Com putational Nucleic Acid Coding and Feature Analysis 
Field of the Invention 

The present invention is in the field of bioinformatics, particularly as it pertains to gene 
prediction. More specifically, the invention relates to the probabilistic analysis of nucleic acid 
5 sequences for the determination of coding features, including determination of state probabilities 
for each nucleotide in a nucleic acid sequence, determination of coding strand, determination of 
open reading frame extent, determination of insertion and deletion location, determination of 
exon location, and determination of protein sequence. 

10 Background of the Invention 

Advances in techniques for sequencing long stretches of genomic deoxyribonucleic acid 
(DNA) have allowed investigators to collect vast nucleic acid sequence data rapidly. These 
advances, combined with initiatives to sequence the entire human genome and the genomes of 
several other species, have created a need for the rapid identification of genes on long stretches 

1 5 of sequenced DNA. Conventional gene location techniques, such as cDN A hybridization, are 
effective at locating transcribed genes, but are time-consuming and costly. 

An alternative for locating genes on DNA that has not otherwise been analyzed for 
potential coding regions involves using statistical detection methods. Such methods 
conventionally include using probability models to predict where in a DNA sequence a gene is 

20 located. The theoretical nucleic acid sequence probabilities can be determined through analysis 
of known coding regions in the organism of interest. Once theoretical nucleic acid sequence 
probabilities are determined, nucleic acid sequences in unannotated regions of DNA in the same 
or a similar organism can be statistically compared to the theoretical nucleic acid sequence 
probabilities. If the similarity is sufficient, the investigator is notified that a coding sequence 

25 exists. Conventional cloning techniques can then be used to isolate the putative gene and check 
for transcription. 

One type of statistical detection method searches DNA by content. In such content- 
based models, highly conserved regions of DNA that are common to all genes are located. If a 
conserved region of DNA is found, then the nucleic acid sequence associated with the conserved 
30 region can be compared with known genes. Such comparisons, which can be done with nucleic 
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acid sequence comparison programs such as BLAST, are inefficient to run, however, and 
content-based searches therefore have limited desirability. 

A second type of statistical detection method searches DNA by signal. This type of 
searching involves using probability models to predict whether DNA fragments within a larger 
5 nucleic acid sequence are coding. Early searching by signal programs, such as TestCode and 
Grail, relied on statistical variations within coding regions of DNA, including codon frequency, 
local nucleic acid sequence composition, codon preference measures, heuristics based on 
oligonucleotide frequency variations, and measures of nucleic acid sequence complexity. 

Beyond simple gene detection, there is also a need for the determination of other coding 
l o features, such as the location of intron/exon boundaries in eukaryotic organisms and the location 
of insertions or deletions. The program GENSCAN (Burge, C. and Karlin, S. (1997) Prediction 
of Complete Gene Structures in Human Genomic DNA. J. Mol. Biol. 268, 78-94), for example, 
predicts exon location with local state probabilities based on oligonucleotide usage. GENSCAN, 
however, also depends on non-local nucleic acid sequence characteristics, which make the 
15 program very sensitive to sequencing errors and genes containing alternative splicing strategies. 
One statistical model that avoids the problems caused by dependence on non-local 
nucleic acid sequence characteristics is the inhomogeneous Markov model. An inhomogeneous 
Markov model depends upon local probabilities, and is not therefore sensitive to sequencing 
errors or genes with alternative splicing strategies. The inhomogeneous Markov model is 
20 "inhomogeneous" because it determines the state probabilities for a given nucleotide in multiple 
reading frames rather than in a single reading frame. GeneMark, for example, is a computer 
program that uses the inhomogeneous Markov model to locate genes. 

The GeneMark gene prediction algorithm was developed in several steps. A series of 
three publications demonstrated that inhomogeneous Markov models were useful tools for gene 
25 prediction (see Borodovsky, M., Sprizhitsky Yu., Golovanov E. and Alexandrov A. (1986) 
Statistical Patterns in Primary Structures of Functional Regions in the E. Coli Genome: I. 
Oligonucleotide Frequencies Analysis, Molecular Biology, 20, 826-833, Borodovsky, M., 
Sprizhitsky Yu, Golovanov E. and Alexandrov A. (1986) Statistical Patterns in Primary 
Structures of Functional Regions in the E. Coli Genome: II. Non-homogeneous Markov Models, 
30 Molecular Biology, 20, 833-840, Borodovsky, M., Sprizhitsky Yu., Golovanov E. and 
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Alexandrov A. (1986; Statistical Patterns in Primary Structures of Functional Regions in the E. 

Coli Genome: III Computer Recognition of Coding Regions, Molecular Biology, 20, 1 145-1 150, 

all of which are herein incorporated by reference in their entirety). The GeneMark method was 

based on an inhomogeneous Markov model and was described in 1993 (see Borodovsky, M. 
5 and Mclninch J. (1993) GeneMark, Parallel Gene Recognition for both DNA Strands, Computers 

& Chemistry, 17, 123-133, and Borodovsky, M. and Mclninch J. (1993) BioSystems v30, pp. 

161-171, both of which are herein incorporated by reference in their entirety). The capabilities of 

the GeneMark program were subsequently investigated (see James D. Mclninch, Prediction of 

Protein Coding Regions in Unannotated DNA sequences Using an Inhomogeneous Markov 
10 Model of Genetic Information Encoding (1997) (Ph.D. dissertation, Georgia Institute of 

Technology, on file with the Georgia Institute of Technology Library, which is herein 

incorporated by reference in its entirety). 

Conventional programs using inhomogeneous Markov models, however, are limited to a 

defined probabilistic model for determining probability, and cannot be tailored by the 
15 investigator to better suit the nucleic acid sequence under study if information about that nucleic 

acid sequence is already available. Further, conventional implementations do not allow for the 

efficient and accurate detection of other nucleic acid sequence features. 

What is needed in the art is a method of determining state probabilities for a nucleic acid 

sequence having some known characteristics, where the method is insensitive to frameshift 
20 insertions or deletions, and compatible methods for detecting other nucleic acid sequence 

features in known or unknown nucleic acid sequences. 

Summary Of The Invention 

The present invention relates to the probabilistic analysis of nucleic acid sequences for 
25 the determination of coding features, including determination of state probabilities for each 
nucleotide in a nucleic acid sequence, determination of coding strand, determination of open 
reading frame extent, determination of insertion and deletion location, determination of exon 
location, and determination of protein sequence. Described herein are methods, devices, and 
systems for analyzing the information content in nucleic acids. 



The present invention includes and provides a method for determining a probability for 
one or more states for a nucleotide in a nucleic acid sequence, comprising: a) determining an 
initial oligonucleotide probability for each of the states for an initial oligonucleotide in the 
nucleic acid sequence; b) determining transition probabilities for each of the states for 
5 nucleotides within the nucleic acid sequence following the initial oligonucleotide; c) determining 
a probability for the nucleic acid sequence for each of the states; and, d) determining a 
probability for each of the states for the nucleotide based upon the probability of the nucleic acid 
sequence and a bias. 

The present invention includes and provides a method for determining a probability for 
10 one or more states for a nucleotide in a nucleic acid sequence, comprising: a) determining an 
initial oligonucleotide probability for each of the states for an initial oligonucleotide in the 
nucleic acid sequence; b) determining transition probabilities for each of the states for 
nucleotides within the nucleic acid sequence following the initial oligonucleotide; c) determining 
a probability for the nucleic acid sequence for each of the states; and, d) determining a 
15 probability for each of the states for the nucleotide based upon the probability of the nucleic acid 
sequence, wherein the determining a probability for each of the states is capable of accepting a 
bias. 

The present invention includes and provides a method for determining a probability for 
each of one or more states for more than one nucleotide in a nucleic acid sequence comprising: a) 

20 determining an initial oligonucleotide probability for each of the states for an initial 

oligonucleotide in a window of a first nucleotide; b) determining transition probabilities for each 
of the states for nucleotides within the window following the initial oligonucleotide; c) 
determining a probability for the window for each of the states; d) determining a probability for 
each of the states for the nucleotide based upon the probability for the window and a bias; and, e) 

25 repeating steps a) through d) for each remaining nucleotide in the nucleic acid sequence. 

The present invention includes and provides a method for determining strand coding of a 
nucleic acid sequence based upon a bias, comprising: a) determining a probability of each of one 
or more states for each nucleotide in the nucleic acid sequence, wherein each of the states is 
either a positive strand state or a negative strand state; b) summing the probabilities of the 

30 positive strand states for each of the nucleotides to produce a sum of probabilities for positive 
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states; c) summing the probabilities of the negative strand states for each of the nucleotides to 
produce a sum of probabilities for negative states; and, d) deciding one of i) coding is mixed or 
not detectable if a first function of the sum of probabilities for positive states and the sum of 
probabilities for negative states is less than a threshold value; ii) coding is on the positive strand 

5 if a second function of the sum of probabilities for positive states is greater than a third function 
of the sum of probabilities for negative states and the first function is not less than the threshold 
value; and iii) coding is on the negative strand if the second function of the sum of probabilities 
for positive states is not greater than the third function of the sum of probabilities for negative 
states and the first function is not less than the threshold value. 

10 The present invention includes and provides a method for determining the extent of an 

open reading frame within a nucleic acid sequence based upon a bias, comprising: a) determining 
the probability of each of one or more states for each nucleotide in the nucleic acid sequence, 
wherein each of the states is either a coding state or a noncoding state; b) determining the coding 
strand of the nucleic acid sequence; and, c) determining the points within the nucleic acid 

15 sequence in the coding strand at which the sum of the probabilities of the coding states for each 
nucleotide drops below a first threshold value for a number of nucleotides greater than a second 
threshold value, wherein ends of the open reading frame are indicated at the points. 

The present invention includes and provides a method for determining the location of 
. insertions and deletions within a nucleic acid sequence, comprising: a) determining the 

20 probability of each of one or more states for each nucleotide in the nucleic acid sequence based 
upon a bias, wherein each of the states is either a coding state or a noncoding state; b) setting a 
length for a window; c) determining which state has a maximum mean probability for the nucleic 
acid sequence on a first side of a middle nucleotide in the window, wherein the window begins at 
a first nucleotide; d) determining which state has a maximum mean probability for the nucleic 

25 acid sequence on a second side of the middle nucleotide in the window; e) determining that a 
deletion or insertion occurred at the middle nucleotide if i) the state with the maximum mean 
probability on the first side of the middle nucleotide is different from the state with the maximum 
mean probability on the second side of middle nucleotide, and ii) either an average of 
hypothetical state probabilities for the window with an insertion at the middle nucleotide or an 

30 average of hypothetical state probabilities for the window with a deletion at the middle 
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nucleotide is greater than a sum of the middle nucleotide's coding states probabilities; and, f) 
repeating steps c) through e) for each remaining nucleotide in the nucleic acid sequence after the 
first nucleotide, wherein the window begins at each remaining nucleotide in turn. 

The present invention includes and provides a method for determining exon location 

5 within a nucleic acid sequence, comprising a) determining the probability of each of one or more 
states for each nucleotide in the nucleic acid sequence based upon a bias, wherein each of the 
states is either a coding state or noncoding state; b) determining the coding strand of the nucleic 
acid sequence; c) determining the extent of an open reading frame within the nucleic acid 
sequence; d) classifying each nucleotide in a coding class or a noncoding class based on a most 

10 probable state for the coding strand; e) reclassifying each nucleotide according to defined rules; 
and, f) determining that regions of the nucleic acid sequence in the coding class are exons. 

The present invention includes and provides a program storage device readable by a 
machine, tangibly embodying a program of instructions executable by a machine to perform 
method steps to determine a probability for each of one or more states for a nucleotide in a 

15 nucleic acid sequence, the method steps comprising: a) determining an initial oligonucleotide 
probability for each of the states for an initial oligonucleotide in the nucleic acid sequence; b) 
determining transition probabilities for each of the states for nucleotides within the nucleic acid 
sequence following the initial oligonucleotide; c) determining a probability for the nucleic acid 
sequence for each of the states; and, d) determining a probability for each of the states for the 

20 nucleotide based upon the probability of the nucleic acid sequence and a bias. 

The present invention includes and provides a program storage device readable by a 
machine, tangibly embodying a program of instructions executable by a machine to perform 
method steps to determine a probability for one or more states for more than one nucleotide in a 
nucleic acid sequence, the method steps comprising: a) determining an initial oligonucleotide 

25 probability for each of the states for an initial oligonucleotide in a window of a first nucleotide; 
b) determining transition probabilities for each of the states for nucleotides within the window 
following the initial oligonucleotide; c) determining a probability for the window for each of the 
states; d) determining a probability for each of the states for the nucleotide based upon the 
probability for the window and a bias; and, e) repeating steps a) through d) for each remaining 

30 nucleotide in the nucleic acid sequence. 
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The present invention includes and provides a program storage device readable by a 
machine, tangibly embodying a program of instructions executable by a machine to perform 
method steps to determine strand coding of a nucleic acid sequence, the method steps 
comprising: a) determining a probability of each of one or more states for each nucleotide in the 

5 nucleic acid sequence based upon a bias, wherein each of the states is either a positive strand 
state or a negative strand state; b) summing the probabilities of the positive strand states for each 
of the nucleotides to produce a sum of probabilities for positive states; c) summing the 
probabilities of the negative strand states for each of the nucleotides to produce a sum of 
probabilities for negative states; and, d) deciding one of i) coding is mixed or not detectable if a 

10 first function of the sum of probabilities for positive states and the sum of probabilities for 
negative states is less than a threshold value; ii) coding is on the positive strand if a second 
function of the sum of probabilities for positive states is greater than a third function of the sum 
of probabilities for negative states and the first function is not less than the threshold value; and 
iii) coding is on the negative strand if the second function of the sum of probabilities for positive 

15 states is not greater than the third function of the sum of probabilities for negative states and the 
first function is not less than the threshold value. 

The present invention includes and provides a program storage device readable by a 
machine, tangibly embodying a program of instructions executable by a machine to perform 
method steps to determine the extent of an open reading frame within a nucleic acid sequence, 

20 the method steps comprising: a) determining the probability of each of one or more states for 
each nucleotide in the nucleic acid sequence based upon a bias, wherein each of the states is 
either a coding state or a noncoding state; b) determining the coding strand of the nucleic acid 
sequence; and, c) determining the points within the nucleic acid sequence in the coding strand at 
which the sum of the probabilities of the coding states for each nucleotide drops below a first 

25 threshold value for a number of nucleotides greater than a second threshold value, wherein ends 
of the open reading frame are indicated at the points. 

The present invention includes and provides a program storage device readable by a 
machine, tangibly embodying a program of instructions executable by a machine to perform 
method steps to determine the location of insertions and deletions within a nucleic acid sequence, 

30 the method steps comprising: a) determining the probability of each of one or more states for 
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each nucleotide in the nucleic acid sequence based upon a bias, wherein each of the states is 
either a coding state or a noncoding state; b) setting a length for a window; c) determining which 
state has a maximum mean probability for the nucleic acid sequence on a first side of a middle 
nucleotide in the window, wherein the window begins at a first nucleotide; d) determining which 
state has a maximum mean probability for the nucleic acid sequence on a second side of the 
middle nucleotide in the window; e) determining that a deletion or insertion occurred at the 
middle nucleotide if i) the state with the maximum mean probability on the first side of the 
middle nucleotide is different from the state with the maximum mean probability on the second 
side of middle nucleotide, and ii) either an average of hypothetical state probabilities for the 
window with an insertion at the middle nucleotide or an average of hypothetical state 
probabilities for the window with a deletion at the middle nucleotide is greater than a sum of the 
middle nucleotide's coding states probabilities; and, f) repeating steps c) through e) for each 
remaining nucleotide in the nucleic acid sequence after the first nucleotide, wherein the window 
begins at each remaining nucleotide in turn. 

The present invention includes and provides a program storage device readable by a 
machine, tangibly embodying a program of instructions executable by a machine to perform 
method steps to determine exon location within a nucleic acid sequence, the method steps 
comprising: a) determining the probability of each of one or more states for each nucleotide in 
the nucleic acid sequence based upon a bias, wherein each of the states is either a coding state or 
noncodiiig state; b) determining the coding strand of the nucleic acid sequence; c) determining 
the extent of an open reading frame within the nucleic acid sequence; d) classifying each 
nucleotide in a coding class or a noncoding class based on a most probable state for the coding 
strand; e) reclassifying each nucleotide according to denned rules; and, f) determining that 
regions of the nucleic acid sequence in the coding class are exons. 

The present invention includes and provides a computer system for determining a 
probability for each of one or more states for a nucleotide in a nucleic acid sequence, comprising: 
an input device for inputting the nucleic acid sequence; a memory for storing the nucleic acid 
sequence; a processing unit configured for retrieving the nucleic acid sequence and for: a) 
determining an initial oligonucleotide probability for each of the states for an initial 
oligonucleotide in the nucleic acid sequence; b) determining transition probabilities for each of 
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the states for nucleotides within the nucleic acid sequence following the initial oligonucleotide; 
c) determining a probability for the nucleic acid sequence for each of the states; and, d) 
determining a probability for each of the states for the nucleotide based upon the probability of 
the nucleic acid sequence and a bias. 

The present invention includes and provides a computer system for determining a 
probability for each of one or more states for more than one nucleotide in a nucleic acid 
sequence, comprising: an input device for inputting the nucleic acid sequence; a memory for 
storing the nucleic acid sequence; a processing unit configured for retrieving the nucleic acid 
sequence and for: a) determining an initial oligonucleotide probability for each of the states for 
an initial oligonucleotide in a window of a first nucleotide; b) determining transition probabilities 
for each of the states for nucleotides within the window following the initial oligonucleotide; c) 
determining a probability for the window for each of the states; d) determining a probability for 
each of the states for the nucleotide based upon the probability for the window and a bias; and, e) 
repeating steps a) through d) for each remaining nucleotide in the nucleic acid sequence. 
15 The present invention includes and provides a computer system for determining strand 

coding of a nucleic acid sequence, comprising: an input device for inputting the nucleic acid 
sequence; a memory for storing the nucleic acid sequence; a processing unit configured for 
retrieving the nucleic acid sequence and for: a) determining a probability of each of one or more 
states for each nucleotide in the nucleic acid sequence based upon a bias, wherein each of the 
20 states is either a positive strand state or a negative strand state; b) summingute probabilities of 
the positive strand states for each of the nucleotides to produce a sum of probabilities for positive 
states; c) summing the probabilities of the negative strand states for each of the nucleotides to 
produce a sum of probabilities for negative states; and, d) deciding one of i) coding is mixed or 
not detectable if a first function of the sum of probabilities for positive states and the sum of 
probabilities for negative states is less than a threshold value; ii) coding is on the positive strand 
if a second function of the sum of probabilities for positive states is greater than a third function 
of the sum of probabilities for negative states and the first function is not less than the threshold 
value; and iii) coding is on the negative strand if the second function of the sum of probabilities 
for positive states is not greater than the third function of the sum of probabilities for negative 
30 states and the first function is not less than the threshold value. 
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The present invention includes and provides a computer system for determining the 
extent of an open reading frame within a nucleic acid sequence, comprising: an input device for 
inputting a nucleic acid sequence; a memory for storing the nucleic acid sequence; a processing 
unit configured for retrieving the nucleic acid sequence and for: a) determining the probability of 
each of one or more states for each nucleotide in the nucleic acid sequence based upon a bias, 
wherein each of the states is either a coding state or a noncoding state; b) determining the coding 
strand of the nucleic acid sequence; and, c) determining the points within the nucleic acid 
sequence in the coding strand at which the sum of the probabilities of the coding states for each 
nucleotide drops below a first threshold value for a number of nucleotides greater than a second 
threshold value, wherein ends of the open reading frame are indicated at the points. 

The present invention includes and provides a computer system for determining the 
location of insertions and deletions within a nucleic acid sequence, comprising: an input device 
for inputting a nucleic acid sequence; a memory for storing the nucleic acid sequence; a 
processing unit configured for retrieving the nucleic acid sequence and for: a) determining the 
probability of each of one or more states for each nucleotide in the nucleic acid sequence based 
upon a bias, wherein each of the states is either a coding state or a noncoding state; b) setting a 
length for a window; c) determining which state has a maximum mean probability for the nucleic 
acid sequence on a first side of a middle nucleotide in the window, wherein the window begins at 
a first nucleotide; d) determining which state has a maximum mean probability for the nucleic 
acid sequence on a second side of the middle nucleotide in the window; e) determining that a 
deletion or insertion occurred at the middle nucleotide if i) the state with the maximum mean 
probability on the first side of the middle nucleotide is different from the state with the maximum 
mean probability on the second side of middle nucleotide, and ii) either an average of 
hypothetical state probabilities for the window with an insertion at the middle nucleotide or an 
average of hypothetical state probabilities for the window with a deletion at the middle 
nucleotide is greater than a sum of the middle nucleotide's coding states probabilities; and, 0 
repeating steps c) through e) for each remaining nucleotide in the nucleic acid sequence after the 
first nucleotide, wherein the window begins at each remaining nucleotide in turn. 

The present invention includes and provides a computer system for determining exon 
location within a nucleic acid sequence, comprising: an input device for inputting a nucleic acid 
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sequence; a memory for storing the nucleic acid sequence; a processing unit configured for 
retrieving the nucleic acid sequence and for: a) determining the probability of each of one or 
more states for each nucleotide in the nucleic acid sequence based upon a bias, wherein each of 
the states is either a coding state or noncoding state; b) determining the coding strand of the 
nucleic acid sequence; c) determining the extent of an open reading frame within the nucleic acid 
sequence; d) classifying each nucleotide in a coding class or a noncoding class based on a most 
probable state for the coding strand; e) reclassifying each nucleotide according to defined rules; 
and, f) determining that regions of the nucleic acid sequence in the coding class are exons. 

The present invention includes and provides a computer program product comprising a 
computer usable medium having computer program logic recorded thereon for enabling a 
processor in a computer system to determine a probability for each of one or more states for a 
nucleotide in a nucleic acid sequence, the computer program logic comprising means for 
enabling the processor to perform each of the following steps: a) determining an initial 
oligonucleotide probability for each of the states for an initial oligonucleotide in the nucleic acid 
sequence; b) determining transition probabilities for each of the states for nucleotides within the 
nucleic acid sequence following the initial oligonucleotide; c) determining a probability for the 
nucleic acid sequence for each of the states; and, d) determining a probability for each of the 
states for the nucleotide based upon the probability of the nucleic acid sequence and a bias. 

The present invention includes and provides a computer program product comprising a 
computer usable medium having computer program logic recorded thereon for enabling a 
processor in a computer system to determine a probability for each of one or more states for more 
than one nucleotide in a nucleic acid sequence, the computer program logic comprising means 
for enabling the processor to perform each of the following steps: a) determining an initial 
oligonucleotide probability for each of the states for an initial oligonucleotide in a window of a 
first nucleotide; b) determining transition probabilities for each of the states for nucleotides 
within the window following the initial oligonucleotide; c) determining a probability for the 
window for each of the states; d) determining a probability for each of the states for the 
nucleotide based upon the probability for the window and a bias; and, e) repeating steps a) 
through d) for each remaining nucleotide in the nucleic acid sequence. 
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The present invention includes and provides a computer program product comprising a 
computer usable medium having computer program logic recorded thereon for enabling a 
processor in a computer system to determine strand coding of a nucleic acid sequence, the 
computer program logic comprising means for enabling the processor to perform each of the 
5 following steps: a) determining a probability of each of one or more states for each nucleotide in 
the nucleic acid sequence based upon a bias, wherein each of the states is either a positive strand 
state or a negative strand state; b) summing the probabilities of the positive strand states for each 
of the nucleotides to produce a sum of probabilities for positive states; c) summing the 
probabilities of the negative strand states for each of the nucleotides to produce a sum of 
10 probabilities for negative states; and, d) deciding one of i) coding is mixed or not detectable if a 
first function of the sum of probabilities for positive states and the sum of probabilities for 
negative states is less than a threshold value; ii) coding is on the positive strand if a second 
function of the sum of probabilities for positive states is greater than a third function of the sum 
of probabilities for negative states and the first function is not less than the threshold value; and 
15 iii) coding is on the negative strand if the second function of the sum of probabilities for positive 
states is not greater than the third function of the sum of probabilities for negative states and the 
first function is not less than the threshold value. 

The present invention includes and provides a computer program product comprising a 
computer usable medium having computer program logic recorded thereon for enabling a 
20 processor in a computer system to determine the extent of an open reading frame within a nucleic — 
acid sequence, the computer program logic comprising means for enabling the processor to 
perform each of the following steps: a) determining the probability of each of one or more states 
for each nucleotide in the nucleic acid sequence based upon a bias, wherein each of the states is 
either a coding state or a noncoding state; b) determining the coding strand of the nucleic acid 
25 sequence; and, c) determining the points within the nucleic acid sequence in the coding strand at 
which the sum of the probabilities of the coding states for each nucleotide drops below a first 
threshold value for a number of nucleotides greater than a second threshold value, wherein ends 
of the open reading frame are indicated at the points. 

The present invention includes and provides a computer program product comprising a 
30 computer usable medium having computer program logic recorded thereon for enabling a 
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processor in a computer system to determine the location of insertions and deletions within a 
nucleic acid sequence, the computer program logic comprising means for enabling the processor 
to perform each of the following steps: a) determining the probability of each of one or more 
states for each nucleotide in the nucleic acid sequence based upon a bias, wherein each of the 
5 states is either a coding state or a noncoding state; b) setting a length for a window; c) 

determining which state has a maximum mean probability for the nucleic acid sequence on a first 
side of a middle nucleotide in the window, wherein the window begins at a first nucleotide; d) 
determining which state has a maximum mean probability for the nucleic acid sequence on a 
second side of the middle nucleotide in the window; e) determining that a deletion or insertion 
10 occurred at the middle nucleotide if i) the state with the maximum mean probability on the first 
side of the middle nucleotide is different from the state with the maximum mean probability on 
the second side of middle nucleotide, and ii) either an average of hypothetical state probabilities 
for the window with an insertion at the middle nucleotide or an average of hypothetical state 
probabilities for the window with a deletion at the middle nucleotide is greater than a sum of the 
1 5 middle nucleotide's coding states probabilities; and, f) repeating steps c) through e) for each 

remaining nucleotide in the nucleic acid sequence after the first nucleotide, wherein the window 
begins at each remaining nucleotide in turn. 

The present invention includes and provides a computer program product comprising a 
computer usable medium having computer program logic recorded thereon for enabling a 
20 processor in a computer system to determine exon location within a nucleic acid sequence, the 
computer program logic comprising means for enabling the processor to perform each of the 
following steps: a) determining the probability of each of one or more states for each nucleotide 
in the nucleic acid sequence based upon a bias, wherein each of the states is either a coding state 
or noncoding state; b) determining the coding strand of the nucleic acid sequence; c) determining 
25 the extent of an open reading frame within the nucleic acid sequence; d) classifying each 

nucleotide in a coding class or a noncoding class based on a most probable state for the coding 
strand; e) reclassifying each nucleotide according to defined rules; and, f) determining that 
regions of the nucleic acid sequence in the coding class are exons. 

The present invention includes and provides a method for determining a probability for 
30 one or more states for a nucleotide in a nucleic acid sequence, comprising determining a 
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probability for each of the states for the nucleotide based upon a probability of the nucleic acid 

sequence and a bias. 

The present invention includes and provides a method for determining a probability for 
each of one or more states for more than one nucleotide in a nucleic acid sequence comprising: a) 
5 determining a probability for each of the states for a first nucleotide in the nucleic acid sequence 
based upon a probability of a window in which the first nucleotide is located and a bias; and, b) 
repeating step a) for the remaining nucleotides in the nucleic acid sequence. 

Description Of The Figures 

1 o Figure 1 is a flow chart representing one embodiment of a method for determining the 

probability of each of the possible states for a single nucleotide in a nucleic acid sequence; 

Figure 2 is a flow chart representing one embodiment of a method for determining the 
probability of each of the possible states for a multiple nucleotides in a nucleic acid sequence; 
Figure 3 is a flow chart representing one embodiment of a method for determining the 
15 coding strand of a nucleic acid sequence; 

Figure 4 is a flow chart representing one embodiment of a method for determining the 
extent of an open reading frame within a nucleic acid sequence; 

Figure 5 is a flow chart representing one embodiment of a method for determining the 
location of insertions and deletions within a nucleic acid sequence; 
20 Figure 6 is a flow chart representing one embodiment of a method for determining the 

extent of exons within a nucleic acid sequence and the protein translation of those exons; 

Figure 7 is a flow chart representing one embodiment of a method for determining the 
extent of exons within a nucleic acid sequence and the protein translation of those exons; 

Figure 8a is a schematic representation of a window located at the end of a nucleic acid 
25 sequence; 

Figure 8b is a schematic representation of a window located at the end of a nucleic acid 
sequence showing nucleotides near the end of the nucleic acid sequence; 

Figure 8c is a schematic representation showing the ends of a nucleic acid sequence being 
copied to form a hypothetical extension on each end of the nucleic acid sequence; 
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Figure 8d is a schematic representation of a nucleic acid sequence showing the appended 

hypothetical extensions; 

Figure 9a is a schematic representation of one embodiment of a computer system that can 
implement the methods of the present invention; 
5 Figure 9b is a schematic representation of one embodiment of a computer system that can 

implement the methods of the present invention; 

Figure 10a is a schematic representation of a genomic sequence of DNA with an aligned 
expressed sequence tag aligned thereto; 

Figure 10b is a schematic representation of a window in a region of DNA when the entire 
10 region is in a known coding region; and, 

Figure 10c is a schematic representation of a window in a region of DNA when part of 
the region is known to be coding, and part of the region is known to be noncoding. 

Detailed Description Of The Invention 

1 5 Described herein are methods for determining the state probabilities of one or more 

nucleotides in a nucleic acid sequence, the coding strand of a nucleic acid sequence, the extent of 
an open reading frame in a nucleic acid sequence, the location of deletions and insertions in a 
nucleic acid sequence, the location of exons in a nucleic acid sequence, and the translation of 
those exons. Also described are program storage devices readable by a machine, tangibly 

20 embodying a program of instructions executable by a machine to perform the above methods. 
Also described are computer systems for implementing the above methods, comprising an input 
device for inputting a nucleic acid sequence, a memory for storing the nucleic acid sequence, 
and a processing unit. Also described are computer program products comprising a computer 
usable medium having computer program logic recorded thereon for enabling a processor in a 

25 computer system to perform the above methods. 
Definitions: 

Nucleic Acid Sequence - As used herein, "nucleic acid sequence" includes a nucleic acid 
sequence of any nucleic acid as is generally understood in the art. The nucleic acid can be DNA, 
cDNA, genomic DNA, raw DNA, expressed nucleic acid sequence tags (ESTs), RNA, mRNA 
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unprocessed RNA, processed RNA, or any other form of nucleic acid, regardless of whether or 
not the nucleic acid actually codes for a protein. 

Nucleic acid sequences can be derived from any natural or artificial source, including 
prokaryotic and eukaryotic organisms, and can be at any stage of processing. 

It is understood by those skilled in the art that any representation of a nucleic acid 
sequence is contemplated herein and within the scope of the present invention. That is, while 
conventionally nucleic acid sequences are represented by the nucleotide or base letters A, T, G, 
C, U, any alphanumeric or other representation of nucleotide or base nucleic acid sequence, 
whether digitally represented or otherwise, is within the scope of this invention. Further, nucleic 
acid sequence notation indicating uncertainty with respect to the identification of one or more 
bases in a nucleic acid sequence, for example IUB nomenclature such as R=G and A, Y=T and 
C, etc., can be incorporated into the method described herein and is within the scope of this 
invention. 

Nucleic acid sequences having modified or non-standard bases can be incorporated into 
the method described herein and are within the scope of this invention. For the purposes of this 
invention, a nucleic acid sequence of "bases" is an equivalent nucleic acid sequence to the 
nucleic acid sequence in which the bases are found. 

Reading frame - A "reading frame" is one of the possible phases in which one can read a 

sequence of codons (groups of three nucleotides) that can make up a coding region of DNA or 

RNA. In a codon the positions in 5' to 3' order are called the "first", "second", and "third" 
reading frames. 

States - The "states" attributable to a nucleotide are the potential permutations of all of the 
possible reading frames and the two nucleic acid strands included in the probability model being 
used. A "+" is used to indicate the positive strand, and "-" to indicate the reverse compliment 
DNA strand. In a preferred embodiment, the possible states of any one nucleotide are positive 
strand first reading frame (1+), positive strand second reading frame (2+), positive strand third 
reading frame (3+), negative strand first reading frame (1-), negative strand second reading frame 
(2-), negative strand third reading frame (3-), positive strand noncoding (N+), and negative 
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strand noncoding (N-). In another embodiment, the states can be, for example, just the four 

positive states listed above. Stated symbolically, "f is an element in the set of states, i.e. f <= 
{1+, 2+, 3+, N+, 1-, 2-, 3-, N-}. 

Coding State - A "coding state" is any of the states 1+, 2+, 3+, 1-, 2-, or 3-, which indicate 
coding, i.e. nucleic acids translated into protein. 

Noncoding state - A "noncoding state" is either of the states N- or N+, both of which indicate 
noncoding, i.e. no protein translation. 

Sequentially - "Sequentially" means performing a step or series of steps on nucleotides in order 
as the nucleotides occur in the nucleic acid sequence, in either direction. 

State probabilities - The "state probabilities" of a nucleotide within a nucleic acid sequence are a 
vector of probabilities associated with the given nucleotide being in each of the states. 

Window - A "window" is a contiguous and defined number of nucleotides within a nucleic acid 
sequence. For example, in a nucleic acid sequence having a length of several thousand 
nucleotides, a window of, again for example, 100 nucleotides can be defined for specific analysis 
at any place within the larger nucleic acid sequence. 

Middle Nucleotide - The "middle nucleotide" in any given nucleic acid sequence or window is 
the nucleotide found at the numerical middle of the nucleic acid sequence or window, 
respectively, wherein the length of a nucleic acid sequence or window is the total number of 
nucleotides in the nucleic acid sequence or window. If the nucleic acid sequence or window has 
an even number of nucleotides, then the middle nucleotide can be either of the two nucleotides 
ajacent the numerical middle of the nucleic acid sequence or window. For example, the middle 
nucleotide in a 101 nucleotide long window is nucleotide number 51, and the middle nucleotide 
in a 100 nucleotide long window can be either nucleotide number 50 or nucleotide number 51. 
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Oligonucleotide - An "oligonucleotide" is a a series of contiguous nucleotides with a defined 
length. 



Initial Oligonucleotide - The "initial oligonucleotide" is the oligonucleotide that occurs at the 
5 beginning of the nucleic acid sequence or window being examined. Therefore, the first 

nucleotide in the initial oligonucleotide is also the first nucleotide in the sequence or window. 

Transition Probability - A "transition probability" for a given nucleotide is the probability of the 
nucleotide occurring given the oligonucleotide immediately preceding that nucleotide. 

10 

Bias Function - The "Bias Function" is a function that is used to differentialy alter the 
probability of one or more states of one or more nucleotides in a nucleic acid sequence. For 
example, if a region of the nucleic acid sequence under study is thought to be a coding region, 
then the bias function can be used to increase the calculated probability of the coding states for 
1 5 that nucleic acid sequence. 

Bias - "Bias" is a set of one or more values that are used in the Bias Function, and is used to alter 
the probability of one or more states of one or more nucleotides in a nucleic acid sequence. 

20 Filter - A "filter" as used herein is any method or algorithm for unifying and making more 

homogeneous regions of a nucleic acid sequence that have been classified in disparate states. A 
filter is used for the purpose of more clearly defining coding region boundaries in a nucleic acid 
sequence. In a method, a step in which a filter is applied is a "filtering step." 

25 Class - A "class" of nucleotides is a group of nucleotides that are designated as having one state 
for the purposes of filtering. 

Positive Strand and Negative Strand - The terms "positive strand (+)" and "negative strand (-)" 
represent complementary nucleic acid sequences. The sequence in one strand is defined by the 
30 sequence in the complementary strand. 
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Positive Strand State - A "positive strand state" is any of states 1+, 2+, 3+, N+. 
Negative Strand State - A "negative strand state" is any of states 1-, 2-, 3-, N-. 
Description 

The methods described herein can be performed in any manner that allows for the 
analysis of the nucleic acid sequence under study and computation of the probabilities associated 
with that nucleic acid sequence. In a preferred embodiment, the physical nucleic acid sequence, 
for example a DNA sequence having a contiguous nucleic acid sequence of G, C, T, and A 
nucleotides, is converted into digital form by, for example, inputting the nucleic acid sequence 
into a computer system. The computer then processes the nucleic acid sequence using the 
methods described herein. Any nucleic acid sequence referred to herein can be arranged to have 
a beginning and an end, and numbered so that the first nucleotide in the nucleic acid sequence is 
number 1, the next nucleotide in the nucleic acid sequence is number 2, and so on until the end of 
the nucleic acid sequence. Any other numbering scheme that is useful can be used. 

The methods shown in Figures 1-7 are independent, and, although several of the methods 
described can be utilized together, they can each be performed as independent methods. Further, 
where one method calls for a step in which one of the other methods can be used for that step, the 
use of the other method in the step represents only one embodiment, and other methods for 
performing the step can be used as well. 

Any probability model applicable to nucleic acid sequence state probabilities can be used 
for the probability steps if the output of the probability model sufficiently supports the method, 
including inhomogeneous Markov models that have fewer than eight states, for example, those 
having only six or four states. In a preferred embodiment, the inhomogeneous Markov model 
has eight states. (For a general discussion of various models, see Durbin, et al., Biological 
Sequence Analysis (1998), which is herein incorporated by reference in its entirety). 

Any nucleic acid sequence source can be used, regardless of the accuracy of the nucleic 
acid sequence relative to the physical molecule it represents, including raw nucleic acid sequence 
data and nucleic acid sequence data that has been changed or adjusted for other purposes, such as 
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nucleic acid sequences that have been filtered to improve accuracy, nucleic acid sequences that 
have been altered to account for known mutations, and nucleic acid sequences that have been 
engineered in any manner whatsoever, among others. Nucleic acid sequence information 
produced by automated nucleic acid sequencers can be used, as well as nucleic acid sequence 
information derived by any conventional sequencing technique, such as dideoxy sequencing, 
among others. Nucleic acid sequences produced by or from other bioinformatic processing 
methods or nucleic acid databases can be used, for example, including nucleic acid sequences 
stored in public access databases such as GenBank. Although nucleic acid sequences with any 
amount of error can be used, in a preferred embodiment the amount of sequencing error present 
is less than about 15%, and more preferably is less than about 10%. However, an advantage of 
the methods of the present invention is that they can utilize lower quality nucleic acid sequences. 
In this embodiment, the methods of the present invention can utilize nucleic acid sequences 
where the average sequence accuracy is less than 99%, more preferably less than 95%, more 

preferably less than 90, 80, or 70%. 

The present invention includes the incorporation of bias into probability models that 
determine state probabilities for one or more nucleotides. The bias is used to alter the statistical 
probability of one or more states for a nucleotide. A bias of zero, for example, will reduce the 
probability of a state to zero, while a bias of one will not alter the statistical probability. Values 
greater than one will increase the statistical probability of a state, while values between zero and 
one will reduce the statistical probability of a state. Bias can be defined by the investigator in 
order to influence the probability of states. In a preferred embodiment, bias is defined to alter the 
probability of states in a manner consistent with existing knowledge of the nucleic acid sequence 
under study. For example, if a nucleic acid sequence has a region that is strongly suspected to be 
coding, then the nucleotides in that region can be assigned a large bias for the coding states, and 
a small bias for the noncoding states. Bias can be incorporated into any conventional statistical 
model that provides a method for determining state probabilities in order to allow for the biasing 
of statistical probabilities in that model. In one embodiment, bias can be defined for each state as 
a number equal to or greater than zero, excluding 1 . In this embodiment, the statistical 
probability of a state will be reduced if the bias is set to a number equal to or greater than zero 
and less than one, and increased if the bias is set to a number greater than one, and all states are 
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biases in one direction or the other. In another embodiment, bias can be defined as one for one or 
more states, and a number other than one for one or more states. In this embodiment, one or 
more states has a defined bias of one, which results in no biasing of the probability of that state, 
while one or more states have a defined value equal to or greater than zero, excluding one. In 
this embodiment, one or more states are biased, and one or more states are not. In a preferred 
embodiment, the bias is between 0.0 and 0.9 or greater than 1.1. 

Figure 1 represents one embodiment of the method of the present invention for 
determining the state probabilities of a single nucleotide within a nucleic acid sequence. The 
nucleotide for which the state probabilities are determined can be any nucleotide in the nucleic 
acid sequence, preferably is a nucleotide close to the middle of the sequence, and in a preferred 
embodiment the nucleotide is the middle nucleotide in the nucleic acid sequence. It is preferable 
to determine state probabilities for a nucleotide at or near the middle of the nucleic acid 
sequence. State probabilities for the nucleotide are determined by first finding the probability of 
the initial oligonucleotide in the nucleic acid sequence, and then finding the transition 
probabilities for the remainder of the nucleotides in the nucleic acid sequence. The initial 
oligonucleotide probability and transition probability information is used to determine the 
probabilities of each of the states for the entire nucleic acid sequence, and the resulting state 
probabilities are assigned to the nucleotide. Eight states are described below for Figure 1, but 
those of skill in the art will readily see that fewer than eight states can be employed. 

Referring now to Figure 1, in step 12, the probability that the initial oligonucleotide 
occurs in each of the states is determined according to equation I: 



Pf{a 1 ...Q k ) = 

(I) f 

where "a,... a*" is an initial oligonucleotide of length k, a, is the first nucleotide in the 
oligonucleotide, N f is the set of all oligonucleotides occurring in the model sample set, and f is an 
element of the set of states, which, in a preferred embodiment, is {l+,2+,3+,N+,l-,2-,3-,N-}. 
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The oligonucleotide length is predefined, and can be any length for which probabilities 
can be reliably generated. Oligonucleotides can be, for example, from 2 to 100 nucleotides, 
preferably 5 to 20 nucleotides, and more preferably from 8 to 12 nucleotides in length. The 
initial oligonucleotide frequencies of all possible oligonucleotides in the model sample set can 
be, for example stored in a look up table, which is accessed as needed. A table defining the 
model sample set can be constructed, for example, by reference to sample nucleic acid sequences 
from a previously examined collection of nucleic acids, preferably from a closely related 
organism, more preferably from the same organism as the nucleic acid sequence under 
investigation. For example, sample nucleic acid sequences from Arabidopsis can be used for a 
table for investigation of nucleic acid sequences of plants such as soybean, maize, etc. Similarly, 
sample nucleic acid sequences from a chimpanzee can be used for a table for investigation of 
nucleic acid sequences of humans. By examining known nucleic acid sequences, model 
oligonucleotide frequencies in each of the states can be determined. A table can include 
indefinite or modified nucleotides, or any other nucleotide variations that occur in nucleic acid 
sequences. Alternatively, it is also possible to use estimation functions in place of such a table of 
probabilities {see, for example, Besemer, J., Borodovsky, M. (1999) Nucl. Acids Res., v.27, pp. 
391 1-3920, which is herein incorporated by reference in its entirety). 

In step 14, the transition probabilities for all nucleotides in the nucleic acid sequence after 
the initial oligonucleotide in each of the states are determined. The transition probability is the 
probability of a nucleotide occuring given the oligonucleotide immediately preceding the 
nucleotide. The transition probability for the first nucleotide transition is set out in equation II: 



\ai...a k+ i\ f 
P/(aib + i|a 1 ...a fc ) = -r- — — p<- 

\ai...ak\ f 

(H) 



where k is the oligonucleotide length, a, is the first nucleotide in the oligonucleotide, 
"a,. ..a,." is the initial oligonucleotide, a^, is the nucleotide immediately following \, and f e 
{ l+,2+,3+,N+,l-,2-,3-,N-}. Equation II determines the transition probability for the first 
nucleotide following the initial oligonucleotide. After determining the transition probability for 
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the first nucleotide after the initial oligonucleotide, the transition probabilities are determined 
sequentially for the remaining nucleotides in the nucleic acid sequence. This means that a 
transition probability is determined for the second nucleotide after the initial oligonucleotide 
(a^,) based on the oligonucleotide beginning at the second position, a 2 , and ending at a k+1 . The 

5 process is repeated until the end of the nucleic acid sequence is reached. For example, if the 
oligonucleotide length is ten, then a transition probability for nucleotide eleven is determined 
based on the oligonucleotide comprising nucleotides one through ten. Then, a transition 
probability for nucleotide twelve is determined based on the oligonucleotide comprising 
nucleotides two through eleven, and so on, until the last nucleotide in the nucleic acid sequence 

10 is reached. 

The transition probabilities can be stored in a table, for example. The table can be 
constructed, for example, by reference to sample nucleic acid sequences from a previously 
examined portion of nucleic acid, preferably from a closely related organism, more preferably 
from the same organism as the nucleic acid under investigation. By examining known nucleic 
15 acid sequences, model transition probabilities in each of the states can be determined. 

In step 16, the probability of the nucleic acid sequence, (S), occurring in each of the states 
(f) is determined by finding the product of the probability of the initial oligonucleotide and the 
transition probabilities in each of the states. This step is set forth in equation III for a model 
with eight states: 

20 

LJ 

P f {S) = P f (ai...a k ) • JJP F(i) (aj b+i+ i|a i ...a i+ jfc) 

i-l 

an) 
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where the function 



i mod 3 + 1 if/=l ± 

n _ . (i + 1) mod 3+1 if/ = 2 ± 

1 (t4-2)mod3+l if/ = 3± 

N if/ = JV± 
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and o is the length of the nucleic acid sequence, and "a,...^" is the initial oligonucleotide. 

In step 18, the probability of each state for the nucleic acid sequence "P(f |S)" is 
determined given the probability of the nucleic acid sequence, S, in each state. A bias function, 
is incorporated into the equation to account for known nucleic acid sequence information. 
This step is set forth in equation IV: 



p( f \s) = Jp-PfPAS) 

i€{l+ t 2+,3+ , N+ t l - ,2- ,3- t N~ } 

(IV) 

wherein Pf is ^ for each coding state (1+, 2+, 3+, 1-, 2-, 3-) and for each noncoding 

state (N+, N-). The bias function is used to modify these default Pf values. By modifying the 
default values, the investigator can account for known nucleic acid sequence features. For 
example, if another bioinformatics process has indicated that there is a high probability that a 
certain portion of a nucleic acid sequence comprises a gene, then it would be advantageous to 
bias the state probabilities in favor of the coding states. The resulting state probabilities 
produced by the method will reflect the bias through stronger probabilities of the coding states 
relative to the noncoding states. 

If, for example, the nucleic acid sequence is known to be a coding nucleic acid sequence, 
the bias function can be defined by equation V: 



(V) 



r i if/* jv± 

W) " \ 0 if / = N± 



Equation V uses a bias of 1 for all coding states, and a bias of 0 for all noncoding states. 
The net effect will be to cause the probability of the sequence in each noncoding state to drop to 
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zero, while leaving the probability of the sequence in the coding states unaffected. Application 
of equation IV then leads to a decrease of the probabilities of the noncoding states to zero, while 
increasing the probabilities of the coding states. 

If the nucleic acid sequence is known to be a noncoding nucleic acid sequence, then the 
bias function can be defined by equation VI: 



(VI) 



i if/ = jV-± 



Equation VI reverses the effect of equation V. Of course, the bias function does not need 
10 to be binary in nature, as is shown in the above two examples, but rather can be defined in any 
manner that corresponds with known nucleic acid sequence data. A principal feature of this 
technique is that it can be used to specifically combine gene prediction information from other 
sources into biasing the results of the state probabilities algorithm shown in Figure 1 (and 
subsequent gene prediction based thereon). 
15 The resulting values for the probability of each state for the nucleic acid sequence can 

now be associated with the nucleotide for which state probabilities were being determined. 

In a further embodiment of the method shown in Figure 1, the nucleic acid sequence is 
part of a larger nucleic acid sequence. This embodiment can be applied to any of the methods 
described herein wherein a nucleic acid sequence is used, including those represented in Figures 
20 1 through 7. 

Figure 1 shows the determination of state probabilities for a single nucleotide in a nucleic 
acid sequence. Oftentimes, however, it will be desirable to determine the state probabilities for 
more than one nucleotide in a nucleic acid sequence. 

Figure 2 represents the application of the method shown in Figure 1 to multiple 
25 nucleotides in a nucleic acid sequence. In order to determine the state probabilities for more than 
one nucleotide, a window is used for each nucleotide that is examined. The nucleotide that is 
being examined is within the window, and the probability determinations set out in equations I, 
II, III, and IV are performed for the sequence in the window. The oligonucleotide probabilities 
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are determined as before for the nucleic acid sequence within the window, probabilities for each 
of the states are determined for the nucleic acid sequence within the window, and those 
probabilities are assigned to the nucleotide within the window for which state probabilities are 
being determined, which, in a preferred embodiment, is the middle nucleotide. Another 
nucleotide is then examined, with the window shifted or redefined around the new nucleotide, 
and so on, until the final nucleotide in the nucleic acid sequence for which state probabilities are 
to be determined is reached. 

In steps 22, 24, 26, and 28, probabilities are determined as in steps 12, 14, 16, and 18 
respectively, with the window in steps 22, 24, 26, and 28 corresponding to the nucleic acid 
sequence in steps 12, 14, 16, and 18 respectively for the purposes of those steps. At step 28, the 
state probabilities for the nucleotide for which state probabilities are being determined are 
associated with that nucleotide. 

In step 30, the algorithm checks to see if the state probabilities for the last nucleotide 
have just been determined. If yes, flow proceeds to step 32 and ends. If in step 30 the last 
nucleotide has not been reached, flow proceeds to step 34, where the next nucleotide for which 
state probabilities are to be determined is designated as the nucleotide to analyze in steps 22, 24, 
26, and 28. After step 34, flow returns to steps 22, 24, 26, and 28, where the state probabilities 
of the designated nucleotide are determined. At step 34 any nucleotide from the remaining 
nucleotides that have not yet had state probabilities determined can be designated the next 
nucleotide. 

In a preferred embodiment, the first nucleotide to be examined in step 22 is the first 
nucleotide in a contiguous nucleic acid sequence of nucleotides for which state probabilities are 
to be determined, each subsequent nucleotide at step 34 is the next nucleotide of the contiguous 
nucleic acid sequence of nucleotides for which state probabilities are to be determined, and the 
last nucleotide in step 30 is the last nucleotide in the contiguous nucleic acid sequence of 
nucleotides for which state probabilities are to be determined. 

The window size can be the same or different for each nucleotide, and the nucleotide can 
be located anywhere within its window. In a preferred embodiment, the window size is the same 
for each nucleotide in the nucleic acid sequence, and each nucleotide is the middle nucleotide in 
its own window. In one embodiment, windows are from 3 nucleotides to 1,000 nucleotides in 
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length, preferably 50 to 200 nucleotides in length, and more preferably from 75 to 125 
nucleotides in length. 

The result of the process shown in Figure 2 is the association of state probabilities with 
each individual nucleotide for which state probabilities were determined. In one embodiment, 
5 the nucleotides for which state probabilities are to be determined are a contiguous nucleic acid 
sequence of nucleotides within a longer nucleic acid sequence of nucleotides. 

Figures 3 through 7 all utilize probability models to determine state probabilities. Any 
probability model that allows for determination of the required probabilities in a plurality of 
states can be used, with use of an inhomogeneous Markov model preferred, and use of the 
l o inhomogeneous Markov model described above in reference to Figure 2 especially preferred. 

Figure 3 represents one embodiment of a method for determining the coding strand of a 
nucleic acid sequence. The process determines the state probabilities for each nucleotide in the 
nucleic acid sequence, sums the positive states for the nucleic acid sequence, and sums the 
negative states for the nucleic acid sequence. If the sums for the positive states and the negative 
15 states are sufficiently different, then the process determines that the state with the greater sum is 
the coding strand. 

In step 38, state probabilities are determined for each nucleotide in the nucleic acid 
sequence for which the coding strand is being determined. In one embodiment, state 
probabilities are determined using the inhomogeneous Markov model described above in 

20 reference to Figure 2. 

In step 40, the probability of each state determined in step 38 for the positive states (1+, 
2+, 3+, and N+) for each nucleotide in the nucleic acid sequence for which the coding strand is 
being determined are summed. That is, the values for the states of noncoding, positive and 
coding, positive in the first, second, and third reading frames for all nucleotides in the nucleic 

25 acid sequence for which the coding strand is being determined are summed. The sum is set to 

the arbitrary variable X. 

In step 42, the values determined in step 38 for the negative states (1-, 2-, 3-, N-) for each 
nucleotide in the nucleic acid sequence for which the coding strand is being determined are 
summed. That is, the values for the states of noncoding, negative and coding, negative in the 
first, second, and third reading frames for all nucleotides in the nucleic acid sequence for which 
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the coding strand is being determined are summed. The sum is set to the arbitrary variable Y. 
Steps 40 and 42 can be performed in reverse order. 

In step 44, a function of X and Y is used to determine whether the state probabilities 
indicate sufficient coding on one strand of the nucleic acid sequence. That is, it is determined 
whether f(X,Y)<T, where T is a defined threshold value. Any function can be used that allows 
for the desired discrimination. In one embodiment, the function used in step 44 is 

f(Xt n = PjM . ^ nX , Y) = , the value of T is about 0.1 to about 0.9, 

preferably is about 0.25 to about 0.75, and even more preferably is about 0.4 to about 0.6. If in 
step 44 the function results in a value that is less than the threshold value, T then flow proceeds 
to step 46, where it is determined that coding is mixed or is not detectable. If in step 44 the 
function results in a value that is equal to or greater than the threshold value, T, then flow 
proceeds to step 48. 

In step 48, it is determined on which strand coding occurs. A function of X is compared 
to a function of Y to determine which strand is coding. Any two functions that allow for the 
proper comparison can be used, including functions that weight one of the two strands. In one 
embodiment, f(X) = X and f(Y) = Y , and the comparison in step 48 simply determines which 
sum is greater. If in step 48 the function of X is found to be greater than the function of Y, then 
flow proceeds to step 50 where it is determined that coding is on the positive strand. If in step 48 
it is determined that the function of X is not greater than Y, then flow proceeds to step 52, where 
it is determined that coding is on the negative strand. 

In another embodiment of the method represented by Figure 3, steps 44 and 46 can be 
removed for situations in which it is already known or suspected that coding is present and only 
on one strand. In this embodiment, flow begins at step 38 and, after executing step 42, flow 
proceeds directly from step 42 to step 48. 

Figure 4 represents one embodiment of a method for determining the extent of an open 
reading frame (ORF) within a nucleic acid sequence. The process determines the extent of the 
open reading frame by first determining the state probabilities for each nucleotide in the nucleic 
acid sequence. Then, beginning from within the nucleic acid sequence, preferably the 
approximate middle of the nucleic acid sequence, and proceeding toward one end of the nucleic 
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acid sequence, the process examines each nucleotide in turn and determines whether the 
nucleotide is sufficiently likely to code. When a sufficient number of nucleotides with an 
insufficient likelihood of coding are encountered, the process determines that one end of the open 
reading frame has been found. The process then repeats from the middle to the other end of the 
nucleic acid sequence in order to find the second end of the open reading frame. 

In step 56, the state probabilities of each of the nucleotides in the nucleic acid sequence 
are determined. As stated above, any probability model that has the correct form of output can 
be used, with an inhomogeneous Markov model preferred, and the inhomogeneous Markov 
model described above and represented in Figure 2 most preferred. 

In step 58, the coding strand of the nucleic acid sequence is determined and designated 
"S." Any algorithm or method that can use the state probabilities produced in step 56 can be 
used, and in a preferred embodiment, the method described above and represented in Figure 3 is 
used. If coding strand is indeterminate, an error can be returned at this step and processing does 
not continue. In applications where the coding strand is already known or suspected, step 58 can 
be omitted from the process, in which case step 56 can flow directly to step 60. 

In step 60 an arbitrary variable, L, is set to half of the length of the nucleic acid sequence, 
S, which designates L the middle nucleotide (determination of the middle for even and odd 
sequences is done as described above for the middle nucleotide). In an alternative embodiment, 
L can initially be set to any nucleotide in the nucleic acid sequence. It is preferred, however, to 
begin with L relatively close to the middle of the putative ORF, because proper resolution of the 
ends of the ORF is then more likely. 

Steps 62, 64, and 66 effectively search through the nucleic acid sequence in a descending 
direction from L toward the first nucleotide in the nucleic acid sequence for one of the ORF ends. 
In step 62, the sum of the probabilities of the coding states on the strand S - that is the set (1+, 
2+, and 3+) or the set (1-, 2-, and 3-) depending on whether strand S is the positive or negative 
strand - for nucleotide L is determined and compared to threshold value T. In an alternative 
embodiment, the probability of all six coding states (l+, 2+, 3+, 1-, 2-, and 3-) can be combined. 
If the sum of the coding states is greater than or equal to a threshold value, T, and the nucleotide 
is greater than the first nucleotide in the nucleic acid sequence (that is, L>1), then L is set to L-l 
and P, an arbitrary counting variable, is set to L-l . In one embodiment, the value of T is about 
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0.1 to about 0.9, preferably is about 0.25 to about 0.75, and even more preferably is about 0.4 to 
about 0.6. 

Flow then proceeds to step 64. If the sum of the coding states, as discussed above, is less 
than T and P is greater than 1, then P is set to P-l . The effect of the two steps, 62 and 64, is to 
5 reduce both L and P at the same rate if the sum of the coding states is greater than or equal to T, 
or to reduce P but not L if the sum of the states is less than T. 

After step 64, flow proceeds to step 66, where it is determined if L-P>T" or P=l . If L- 
P>T", wherein T is a threshold value, then a gap between the last nucleotide (L) with a sufficient 
sum of coding states and the current nucleotide being examined has increased beyond the 
10 threshold value T. T" can be set to any number that allows for the proper gap of noncoding 
nucleotides. T" should be larger than the maximum expected length of an intron for the nucleic 
acid sequence. This number will depend in large part on the model sample set being used. If the 
number for T" is set too low, then a relatively lengthy intron will be sufficient to fix L at the end 
of an exon that is not at the end of the ORF. If P=l, then the end of the sequence has been 
15 reached. In one embodiment, T is about 10 to about 20,000 nucleotides, preferably about 50 to 
about 10,000 nucleotides, and more preferably about 500 to about 700 nucleotides. 

If neither condition in step 66 is met, then flow returns to step 62 and loops through steps 
64 and 66 until one of the conditions in step 66 is met, at which point flow proceeds to step 68. 
Steps 68, 70, 72, and 74 check for the end of the ORF in the ascending direction, and perform the 
20 same function as steps 60, 62, 64, and 66 but in the opposite direction. 

In step 68, M is set to the middle nucleotide. As above for L, this value can be altered in 
alternative embodiments. In step 70, the sum of the coding states, as above, is compared to T, 
and M is compared to the length of the nucleic acid sequence. If the sum of the coding states of 
nucleotide M is greater than or equal to T and M is less than the length of the nucleic acid 
25 sequence, then M is set to M+l and Q is set to M+l . Flow proceeds to step 72, where, if the sum 
of the coding states is less than T and Q is less than the length of the nucleic acid sequence, then 
Q is set to Q+l . Flow proceeds to step 74, where it is determined if Q-M>T', or Q> length of the 
nucleic acid sequence. If either is true, then flow proceeds to step 76, where the ORF is 
determined to extend from nucleotide L to nucleotide M. If in step 74 neither condition is true, 
30 then flow loops to step 70. 
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In an alternative embodiment, different threshold values can be used in place of T and T" 
for the second loop, which comprises steps 70, 72, and 74. Different threshold values for steps 
62, 64, and 66 versus steps 70, 72, and 74 could be desirable if, for example, one end of an ORF 
was known or suspected to be degraded to some extent. 

Figure 5 is a flowchart representing one embodiment of a method for determining the 
location of deletions and additions within a nucleic acid sequence. The process first determines 
the state probabilities for each nucleotide in the nucleic acid sequence. Then the process 
determines whether in the window around a specific nucleotide the most likely state for the 
nucleic acid sequence on one side of the specific nucleotide is different from the most likely state 
for the nucleic acid sequence on the other side of the specific nucleotide. If so, the process 
determines whether a hypothetical insertion or deletion at the specific nucleotide would 
sufficiently improve the state probabilities of the entire nucleic acid sequence in the window. If 
so, then an insertion or a deletion is indicated. 

In step 78, the state probabilities of each of the nucleotides in the nucleic acid sequence is 
determined. As stated above, any probability model that has the correct form of output can be 
used, with an inhomogeneous Markov model preferred, and the inhomogeneous Markov model 
described above and represented in Figure 2 most preferred. 

In step 80, the first nucleotide is designated as "Z," and the size of a window, W, is set. 
In step 82, the probabilities of each of the states of the nucleotides between Z and the midpoint of 

the window Z+— are averaged, and the state with the greatest average is set to "A" (windows 
2 

with an even or odd number of nucleotides are treated as above for the middle nucleotide with 
respect to determination of y ). "A" is effectively the most likely state of the first half of 
window W. 

In step 84, the probabilities of the states of the nucleotides between the midpoint of the 

window Z+— and the end of the window, Z+W, are averaged, and the state with the greatest 
2 

average is set to B. B is effectively the most likely state of the second half of window W. 

In step 86, the most probable states, A and B, are checked to see if they are each a coding 
state and not the same coding state. If both A and B are coding states and they are not the same 
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w 

coding state, then flow proceeds to steps 88, 90, and 92, where the nucleotide at Z+ — is 

examined further. If, in step 86, A and B are the same coding state, or if one of the two is most 
probably a noncoding state, then flow proceeds to 96, where it is determined if Z is greater than 

the length of the nucleic acid sequence minus y . If so, then flow proceeds to step 98, and the 

w 

process ends. If, in step 96, Z is not within a distance of - of the end of the nucleic acid 
sequence, then flow proceeds to step 100, where Z is increased by one. Flow then loops to step 
82. 

If in step 86 if it was determined that both conditions were met, then flow proceeds to 

W 

steps 88 through 92 to determine if either a deletion or an addition occurred at nucleotide Z+ — . 

In step 88, a hypothetical average of state probabilities for state A for the entire window, 
nucleotides Z to Z+W, for an insertion is determined. The hypothetical average of state 

W 

probabilities for state A is determined for the window as if the nucleotide at Z+— is removed. 

The probabilities of state A of the nucleotides in W are averaged to obtain the hypothetical 
average state probabilities for state A for the entire window, and the value is set to N. In step 90, 
a hypothetical average of state probabilities for state A for the entire window, nucleotides Z to 
Z+W, for a deletion is calculated similarly. The hypothetical average of state probabilities for 
state A in step 90 is determined and set to M for the window as if a nucleotide has been added on 

one side or the other of the nucleotide at Z+y . By averaging the state probabilities of all of the 

nucleotides in the window for either an insertion or a deletion, the values of N and M reflect the 
likelihood that either an insertion or a deletion has taken place. In steps 88 and 90, in an 
alternative embodiment, state B can be used in place of state A to achieve a similar result. 

In step 92, the larger of M and N is compared to the sum of the probabilities of the states 

w 

indicating coding (1+, 2+, 3+, 1-, 2-, and 3-) of the nucleotide at Z+-. If in step 92 neither M 

W 

nor N is greater than the sum of the probabilities of the coding states of the nucleotide at Z- 
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then it is determined that no insertion or deletion has taken place and flow proceeds to step 96. If 
in step 92 either M or N is greater than the sum of the probabilities of the coding states of the 

nucleotide at Z=— then it is determined that an insertion or a deletion has taken place, and flow 
2 

proceeds to step 94. 

5 In step 94, a deletion is indicated if N is greater than M, and an insertion is indicated if N 

is not greater than M, and flow then proceeds to step 96. 

Figure 6 is a flow chart representing one embodiment of a method for determining the 
location of one or more exons within a nucleic acid sequence and the protein translation of those 
exons. The process begins by determining the state probabilities for each nucleotide in the 
10 nucleic acid sequence, the coding strand, and the extent of the open reading frame. The process 
then classifies each nucleotide according to its most probable state. Filters, which reclassify 
nucleotides in a defined manner in order to make local blocks of the nucleic acid sequence 
consistent, are then applied to the nucleic acid sequence. Regions of the nucleic acid sequence 
that are in any of classes 1, 2, or 3 are then designated as exons, and the exons are translated. 
1 5 Translation is accomplished by using the universal genetic code to convert the nucleic acid 
sequence of the designated exons into the corresponding amino acid sequence based on the 
reading frame of the class. That is, exons in class 1 will be translated in reading frame 1, exons 
in class two will be translated in reading frame 2, and exons in class 3 will be translated in 
reading frame 3. The translation is linearly arranged to correspond to the linear arrangement of 
20 the exons along the nucleic acid sequence. 

In step 1 02, the state probabilities of each of the nucleotides in the nucleic acid sequence 
are determined. As stated above, any probability model that has the correct form of output can be 
used, with an inhomogeneous Markov model preferred, and the inhomogeneous Markov model 
described above and represented in Figure 2 most preferred. In step 104, the strand and the 
25 extent of the open reading frame is determined. Any method for determining the.strand and the 
extent of the ORF that can use the state probabilities generated in step 102 can be used, and in a 
preferred embodiment, the methods described above and represented in Figures 3 and 4 can be 
used for such determination. 
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In step 106, the nucleotides in the nucleic acid sequence are categorized as the highest 
probability state as determined in step 102. For example, in a model having four states for each 
nucleic acid strand, each nucleotide is categorized as 1, 2, 3, or N. 

In step 108, which is optional, one or more filters are applied to the nucleic acid sequence 
5 in order to group adjacent nucleotides by class. Any filter that converts portions of the nucleic 
acid sequence with inconsistent nucleotide classification to a more homogeneous state can be 
used. The net effect of the application of one or more filters to the nucleic acid sequence 
classification in step 104 will be to group adjacent nucleotides and blocks of nucleotides into the 
same coding classification, thereby making exon and introns more uniform, and exon and intron 
10 boundaries more evident. 

In step 1 10, the filtered nucleic acid sequence is analyzed for exons. Any contiguous 
regions with coding classes of 1, 2, or 3 are determined to be exons. Once each exon has been 
identified, the exons can be translated using the universal genetic code, and a resulting protein 
sequence derived. 

15 Figure 7 is a second embodiment of the method described above and represented in 

Figure 6, with explicit filtering steps detailed therein. In Figure 7, steps 102, 104, 106, and 1 10 
are the same as those described above and shown in Figure 6. In Figure 7, after step 106, steps 
1 12, 1 14, 1 16, 1 18, 120, 122, and 124 are filter steps that are applied to the categorized nucleic 
acid sequence produced in step 106. The order shown for the filter steps, 1 12, 1 14, 116, 118, 

20 120, 122, and 124, can be rearranged to occur in any order in the process, and any combination 
of the steps can be used, including combinations that omit one or more of the filtering steps. 

In step 1 12, any noncoding nucleotide flanked by two nucleotides with the same class is 
reclassified into the class of the two flanking nucleotides. For example, 1,N,1 would be 
converted to 1,1,1. 

25 In step 1 14, any nucleotide that is flanked by two pairs of adjacent nucleotides all with 

the same class is reclassified into the class of the flanking nucleotides. For example, 1,1,2,1,1 
would be converted to 1,1,1,1,1. 

In step 1 1 6, any adjacent nucleotide pair having the same class that is flanked by two 
pairs of adjacent nucleotides all with the same class is reclassified into the class of the flanking 

30 nucleotides. For example, 1,1,2,2,1,1 would be converted to 1,1,1,1,1,1. 
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In step 1 18, any adjacent nucleotide pair having the same class that is flanked by two 
nucleotides with the same class is reclassified into the class of the flanking nucleotides. For 
example, 1 ,2,2, 1 would be converted to 1 , 1 , 1 , 1 . 

In step 120, any nucleotide flanked by two nucleotides with the same class is reclassified 

5 into the class of the flanking nucleotides. For example, 1 ,2, 1 is converted to 1 , 1 , 1 . 

In step 122, any contiguous, noncoding nucleotide region with an insufficient length is 
reclassified into the class of the flanking coding regions. An insufficient length is any length that 
is too small to be an intron. This length will be dependent in large part upon the particular 
nucleic acid sequence under study. In one embodiment, a length of about 10 to 50, preferably 

10 about 20 to 40, and more preferably about 25 to 35 nucleotides in length is used. The size of the 
noncoding nucleotide length required can, in alternative embodiments, be changed as appropriate 
to better suit examination of the nucleic acid sequence under study. In step 122, the 
classification of the flanking regions of coding nucleotides can be extended into the noncoding 
regions an equal amount on either side, an unequal amount on either side, or entirely on one side 

15 or the other. 

In step 124, any coding region (i.e. a region with nucleotides of classes 1, 2, or 3, 
comprising more than one nucleotide classification) is reclassified as the most common class in 
that coding segment. 

Flow proceeds to step 1 10, where the filtered nucleic acid sequence is analyzed for exons. 

20 Any contiguous regions with nucleotides of classes 1 , 2, or 3 are determined to be exons. Once 
each exon has been identified, the exons can be translated using the universal genetic code, and a 
resulting protein sequence derived. 

While performing the methods described above in Figures 1-7, windows can sometimes 
extend past the end of a sequence. Conventional applications that use window-based probability 

25 models for multiple nucleotides, such as the windows described above, are limited in their 
application at the ends of nucleic acid sequences. Since coding probability can be calculated 
using a window that is centered on each nucleotide of a nucleic acid sequence in turn, a window 
can extend beyond an end of a sequence. Figure 8a schematically represents a nucleic acid 
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sequence 200 with a window 204 of length "W." As shown in Figure 8a, the window 204 is 
empty for the first y bases at an end 206 of the sequence 200. 

As shown in Figure 8b, the present invention remedies this problem by using the local 
nucleic acid sequence 216 at the end 206 of the nucleic acid sequence 200 as a source for 

5 hypothetical nucleotides added on to the end 206 the nucleic acid sequence 206. As shown in 
Figure 8c, a copy 218 of the local nucleic acid sequence 216 can be created. As shown in Figure 
8d, the copy 218 can then be appended onto the end 206 to form a hypothetical nucleic acid 
sequence extension. As shown in Figure 8d, the window 204 is now filled with nucleotides from 
the nucleic acid sequence 200 and the hypothetical nucleic acid sequence extension 218, which 

10 allows for probability determination within the window 204. As shown in Figures 8b, 8c, and 
8d, the same process can be performed on the other end of the sequence at the same time. Any 
number of nucleotides can be copied and added in this manner in order to provide the correct size 
window. In a preferred embodiment, the number of nucleotides copied is a multiple of three. 
For example, if a 100 nucleotide window is desired for the first nucleotide in the nucleic acid 

1 5 sequence, the first 5 1 nucleotides of the nucleic acid sequence can be copied to form a 

hypothetical 51 nucleotide extension. When state probabilities are determined for the first 
nucleotide, the 51 appended nucleotides are used to fill the first half of the window. The same or 
different nucleotides can be copied and used in a similar manner for any other nucleotides 
without a sufficient window. This process can be repeated for the other end of the nucleic acid 

20 sequence, of course, as needed. The copied nucleotides can be appended in either orientation on 
the end of the nucleic acid sequence. 

Implementation: 

A computer system capable of carrying out the functionality and methods described 
above is shown in more detail in Figure 9a. A computer system 702 includes one or more 
25 processors, such as a processor 704. The processor 704 is connected to a communication bus 
706. The computer system 702 also includes a main memory 708, which is preferably random 
access memory (RAM). Various software embodiments are described in terms of this exemplary 
computer system. After reading this description, it will become apparent to a person skilled in 
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the relevant art how to implement the invention using other computer systems and/or computer 

architectures. 

In a further embodiment, shown in Figure 9b, the computer system can also include a 
secondary memory 710. The secondary memory 710 can include, for example, a hard disk drive 

5 712 and/or a removable storage drive 7 1 4, representing a floppy disk drive, a magnetic tape 

drive, or an optical disk drive, among others. The removable storage drive 714 reads from and/or 
writes to a removable storage unit 71 8 in a well known manner. The removable storage unit 718, 
represents, for example, a floppy disk, magnetic tape, or an optical disk, which is read by and 
written to by the removable storage drive 714. As will be appreciated, the removable storage 

1 0 unit 7 1 8 includes a computer usable storage medium having stored therein computer software 
and/or data. 

In alternative embodiments, the secondary memory 710 may include other similar means 
for allowing computer programs or other instructions to be loaded into the computer system. 
Such means can include, for example, a removable storage unit 722 and an interface 720. 

1 5 Examples of such can include a program cartridge and cartridge interface (such as that found in 
video game devices), a removable memory chip (such as an EPROM, or PROM) and associated 
socket, and other removable storage units 722 and interfaces 720 which allow software and data 
to be transferred from the removable storage unit 722 to the computer system. 

The computer system can also include a communications interface 724. The 

20 communications interface 724 allows software and data to be transferred between the computer 
system and external devices. Examples of the communications interface 724 can include a 
modem, a network interface (such as an Ethernet card), a communications port, a PCMCIA slot 
and card, etc. Software and data transferred via the communications interface 724 are in the form 
of signals 726 that can be electronic, electromagnetic, optical or other signals capable of being 

25 received by the communications interface 724. Signals 726 are provided to communications 
interface via a channel 728. A channel 728 carries signals 726 in two directions and can be 
implemented using wire or cable, fiber optics, a phone line, a cellular phone link, an RF link and 
other communications channels. In one embodiment, the channel is a connection to a network. 
The network can be any network known in the art, including, but not limited to, LANs, WANs, 

30 and the Internet. Nucleic acid sequence data can be stored in remote systems, databases, or 
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distributed databases, among others, for example GenBank, and transferred to computer system 
for processing via the network. In a preferred embodiment, nucleic acid sequence data is 
received through the Internet via the channel 728. Nucleic acid sequences can be input into the 
system and stored in the main memory 708. Input devices include the communication and 
storage devices described herein, as well as keyboards, voice input, and other devices for 
transferring data to a computer system. In a further embodiment, nucleic acid sequences can be 
generated by an automatic sequencer, for example any that are known in the art, and the 
implementations described herein can be incorporated within the automatic sequencer device in 
order to directly use the output of the automatic sequencer. 

In this document, the terms "computer program medium" and "computer usable medium 1 
are used to generally refer to media such as the removable storage device 718, a hard disk 
installed in hard disk drive 712, and signals 726. These computer program products are means 
for providing software to the computer system. 

Computer programs (also called computer control logic) are stored in the main memory 
708 and/or the secondary memory 710. Computer programs can also be received via the 
communications interface 724. Such computer programs, when executed, enable the computer 
system to perform the features of the present invention as discussed herein. In particular, the 
computer programs, when executed, enable the processor 704 to perform the features of the 
present invention. Accordingly, such computer programs represent controllers of the computer 
system. 

In an embodiment where the invention is implemented using software, the software may 
be stored in a computer program product and loaded into the computer system using the 
removable storage drive 714, the hard drive 712 or the communications interface 724. The 
control logic (software), when executed by the processor 704, causes the processor 704 to 
perform the functions of the invention as described herein. 

In another embodiment, the invention is implemented primarily in hardware using, for 
example, hardware components such as application specific integrated circuits (ASICs). In one 
embodient incorporating ASIC technology, a self-contained device, which could be hand-held, 
has integrated circuits specific to perform the methods described above without the need for 
software. Implementation of such a hardware state machine so as to perform the functions 
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described herein will be apparent to persons skilled in the relevant art(s). In yet another 
embodiment, the invention is implemented using a combination of both hardware and software. 

The following examples are illustrative only. It is not intended that the present invention 
be limited to the illustrative embodiments. 

5 

EXAMPLE 1 

Referring now to Figures 10a, 10b, and 10c, examples of biasing are shown. Figure 10a 
shows a portion of genomic DNA 300. Aligned with the genomic DNA 300 is an expressed 
sequence tag (EST) 302. The EST 302 comprises coding regions 304 and noncoding regions 

10 306. In Figure 10b a window 308 of nucleotides is examined. The window 308 is positioned on 
the genomic DNA 300 that corresponds to a known coding region 304 on the EST 302. The a 
priori probability of coding is said to be 100% over that window 308 and a bias is applied 
accordingly. In Figure 10c, a different window 310 straddles the intron-exon boundary, and the 
a priori probability of coding is said to be 100% for the nucleotides in the window 3 10 that 

15 correspond to the coding region 304 of the EST 302, while the a priori probability of coding is 
said to be 0% for the nucleotides in the window 3 10 that correspond to the noncoding region 306 
bftheEST302. 

Bias is applied to the two different situations shown in Figures 10b and 10c as follows. 
The general equation for the probability of the sequence 5 = a,. ..a,, of a Markov process of order 
20 n is shown in Equation VII: 

P{ai...aJ) = P(a,...a„) • P(a n+ i|ai...o„) - .../ > (u w |a a ,_„...a< 1 ,_i) 

This equation is based on an inhomogeneous Markov model, whereby the initial and 
25 transitional probabilities are dependent on the periodic state of the sequence (as in a hidden 
Markov model with fixed state transition probabilities). In this model, initial and transition 
probabilities are dependent on the sequence orientation and phase in which the sequence is read 
relative to the codons in the coding portion of the nucleic acid sequence. Thus, equation VIII is 
used: 
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Pf(S) = P/(ai-«n) - [I PF(U<r)(*n+i\ a i- a n+i-l) 



(VIII) 



i=l 



where, given a state a € { 1+, 2+, 3+, N+, 1-, 2-, 3-, N-} representing the possible states 
5 for reading the sequence, wherein ... 



F(t) = 



f imod3 + l if/ = l ± 

(i+l)mod3 + l if/ = 2* 

(t + 2) mod 3 + 1 if/ = 3 ± 

N if/=JV± 



(IX) 



Equation X is used to apply Bayes' rule to determine the probability that the sequence S is 
10 in state a: 



P{°\S) = 



p. -PAS) 



£ Pi- Pi(S) 

ie{l + ,2+,3+,N + ,l-,2-,3-,iV-} 



(X) 



A bias function is added to equation X in order to allow for biasing of regions of DN A for 
1 5 which coding information is available. The bias function is incorporated in equation XI: 



4>(P) ' Per ■ PAS) 



(XI) 



£ <t>{o) • Pi • Pi(S) 

»€{l + ,2 + ,3+,N+,l-,2-,3-,JV-} 
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Equation XI can be applied to the hypothetical region of DNA shown in the window 308 
in figure 10b. Since the entirety of the sequence in the window 308 lies in a coding region (as 
determined with the EST 302), a bias function 0(0) can be defined according to equation XII: 



m fl ifcrG{t+,2+ > 3+} 

(XII) 



which reflects that we know with 100% certainty that the sequence segment must be 
coding in one of the thee direct reading frames, but that we do not know which. In this case, 
10 since 0(a) = 0 where a € {N+, 1-, 2-, 3-, N-} , equation XII can be written as equation XIII: 



P(o\S) = 



Pa ■ PAS) 



(XIII) 



£ Pi ■ m) 



-I 



if(T6{r,2-,3-,iV+,iV-} 
ifa6{l + ,2 + ,3 + } 



15 Because P 1+ = P 2+ = P 3+ (since the EST does not indicate any difference in probability 

among the three reading frames), equation XIII can be simplified as shown in equation XIV: 



Pa{ s) 



(XIV) 



£ Pi{s) 

<e{i+,2+,3+} 



-1 



if(r€{1-,2-,3-,N + 1 ^-] 
ifaG{lV2',3»} 



20 



The function 0(0) results in a coding potential (equation XIV) substantially different than 
the unbiased coding potential function (shown by equation X). In this example, the chosen bias 
function reduces the probability of the evaluated window 308 to zero in all but the three plus- 
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strand coding states. This effectively forces the window to be evaluated as coding in one of the 
positive coding states, while not biasing the probability of those states relative to each other (e.g., 

P P\ 

— l — is the same with or without the bias function whereas — - may differ). 

Figure 10c illustrates a window 310 wherein the evaluated sequence straddles an exon- 
intron boundary as indicated by the EST 302. A possible function 0(0) for this situation would 
be to expand equation XII to equation XIII: 



P{a\S) 



(XIII) 



f e if 
= { 1-e if 
{ 0 if 



ifcr6{l+,2+,3 + } 
a£{N+,N-} 
<r€ {l-,2-,3-} 



10 



where e represents the fraction of bases in the part of the sequence in the window that lies 
in the coding region of the DNA 300 as indicated by the coding region 304 of the EST 302. If 
equation XIII is put into equation DC, equation XIV results: 



15 (XIV) 



1 -1 



(\-e)-P.-P.{S) 



£ W)P,Pi(S) 

te{l+.2+,3+.JV + ,JV-} 



£ M-P.-PiiS) 



if a 6 {l-,2-,3-} 
iUellVi'^'} 

if <re{N+,N-) 



where P a = - for a e {N+, N-} and - for o € {1+, 2+, 3+} (given the assumption that 
4 6 

coding and noncoding are equiprobable events, each coding state is equiprobable with any other 

• _ , . 1 _ _ 1 ,1 . 1. 
20 coding state, and that both noncoding states are equiprobable, - x2--and-xJ--j. 
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EXAMPLE 2 

The following example illustrates the computations involved in probability calculations 
for a sequence with and without a bias applied. The nucleotide sequence GATGACATT is used 
in this example for clarity and simplicity, but it is understood that longer sequences as indicated 
5 above can be used. Further, for this example, a zero order inhomogeneous Markov model is 
used. In this model, the initial probabilities are all 1 and each event is independent of that which 
precedes it (a,...^ a^, becomes N— ► a, because k is zero). Models of higher order can be used, 

as described above. 

Accordingly, the following hypothetical table of probabilities is used: 

10 





Direct (+) 


Reverse (-) 






1+ 


2+ 


3+ 


1- 


2- 


3- 


N± 


T 


0.13 


0.2 
7 


0.13 


0.10 


0.25 


0.21 


0.20 


C 


0.28 


0.2 
6 


0.39 


0.39 


0.21 


0.38 


0.30 


A 


0.21 


0.2 
6 


0.09 


0.13 


0.27 


0.13 


0.21 


G 


0.38 


0.2 
1 


0.39 


0.38 


0.26 


0.28 


0.29 



Without a bias function 0(0) to incorporate known information in the calculations, P(S|o) 
can be calculated for the zero order case for the sequence GATGACATT according to equations 
XV through XXI. 

15 
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(XV) 



P(CATGACATT|1 + ) = P{N) • P l+ (G\N) • P 2+ (4|JV) • P 3+ (T| AT). 

Pi + ((?|iV)-P 2+ (>i|iV)-F 3+ (C|^)- 
P 1+ (>4|AT)-P 2+ (r|iV)-P3 + (r|Ar) 

PMG) ■ P 2 +(A) • P 3 +(CY 

= 0.38 x U.26 x 0.13x0. 38 x 0.26 x 

0.39 x 0.21 x 0.27 x 0-13 
= 3.6479448 x 10" G 



(XVI) 



P(GATGACATT|2 + ) = P 2+ {G) • P 3+ (A) - P 1 + (T)- 

P-MG) < PMA) ' PMC)- 
P 2+ (A) ■ P 3+ (T) • P I + (T) 

= 0.21 x 0.09x0.13 x 0.21 x 0.09x 
0.28x0.26x0.13x0.13 

= 5.71332739 x 10~ 8 
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P(GATGACATT|3+) = 



(XVII) 



P*+(G) ■ A+M) • P 2+ (T)- 
PMG) ■ PMA) • P 2+ (C)- 
P 3 f(>4).P 1+ (T).P 2+ (T) 
0.39 x 0.21 x 0.27 x 0.39 x 0.21 x 
0-26 x 0.09x0.13x0.27 
1.4874917 x 10 " 6 
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P (G ATGACATT) 1 ~ ) = 



(XVIII) 



P i -(G)-P 2 -(A)-P,-(T)- 

Px-(G)-P 2 ~{A)-P 3 -(Cy 

P l -(A)P 2 -(T)P 3 -(T) 

0.38 x 0.27 x 0.21 x 0.38 x 0.27 x 

0.38 x 0.13 x 0.25 x 0-21 

5.7332419 x 10~ G 
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P(GATGACATT|2-) = P 2 -(G) ■ P Z -(A) - A-(T)- 

P 2 -(C).F 3 -M)-Pi-(C)- 

Pa-w-ivcn-p.-co 

=: 0.26 x 0.13 x 0.10 x 0.26 x 0.13x 

0.39 x 0.27 x 0.21 x 0.10 
= 2.5262776 x 10~ 7 

(XIX) 



P(GATGACATT|3 ) = P 3 - (G) ■ Pi-(A) • P 2 - (7> 

P 3 -{G)-Px-(A)P2-(C)- 
P 3 -{A)'P t -(T)P 2 -{T) 
= 0.28 x 0.13 x 0.25 x 0.28 x 0. 13 x 

0.21 x 0.13 x 0.10 x 0.25 
= 2.2007130 x 10~ 7 

(XX) 



10 

P(GATGACATT|iV) = P N (G) • Pn(A) - Pn(T)- 

Pn{G)'Pn(A)-Pn(C)- 
P n (A)>P n (T);Pn(T) 
= 0.29x0.21 x 0.20 x 0.29x0.21 x 

0.30 x 0.21 x 0.20 x 0.20 
= 1.8692402 x KT 6 

(XXI) 



Given the values of P(S| a), we can determine the probability that the given sequence 
15 segment is in state a, P(a|S) using equation XXII (Bayes' Rules): 

m P[a)'P{S\a) 

W S) -Z[PU-P(s\i)) 

(XXII) 



Equations XXIII through XXIX show the calculations for each of the states. 

20 



45 



(XXIII) 



* ^(3.6479448X10-°) 

^(3.6479448x10-°) I J(l .8692402X l(>-») 

3. 0309540 xl0~ 7 

1.1060761x10-° 

0.27484131 



P(2+|5) = 



(XXIV) 



4. 7811061x10 
l7» 06076 1x10" 

0.004304501 



-9 



10 



P(3+|5) = 



(XXV) 



1.12396764xlQ- T 
1.1060761X10-" 

0.11156173 



15 



(XXVI) 

20 



p/i-IO - 4.7777018X10-]' 
-*\ PJ — 1.1060781x10-° 

= 0.43195053 



Dm- I CI - 2.1062313x10 * 

PJ — 1.1060761x10-° 

= 0.019033331 

25 (XXVII) 



30 



n/a-icn _ 1.8839275xl0~ g 

PJ — 1.1060781x10-° 

= 0.017032531 

(XXVIII) 
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P/WICX - 1.657002X1Q- 7 

r(iV\&) — 1.1060761x10'° 

= 0.14070807 

(XXIX) 



The coding probability function indicates a 43% probability that the sequence is coding in 
10 the first reading frame of the reverse-complement strand (-) of the sequence provided, based on 
the zero order inhomogeneous Markov model used. While the most probable state, it is also true 
that there is a greater probability (57%) that the sequence is not in that state. 

An investigator can apply the bias function method to impose a bias based on 
prior knowledge of sequence features, such as an EST alignment to the subject sequence, or 
15 homology to a previously characterized sequence. For example, given an EST alignment to the 
subject sequence that implies the sequence is coding on the positive strand, a bias function can be 
defined that summarizes that observation. Equation XXX is one example of such a function: 



20 



(XXX) 



it \ f 0.95 if a 6 {1+,2 4 ,3 + } 

0.05 if a${ 1+2+3+) 



This bias function does not exclude the possibility that the sequence is noncoding or 
coding on the reverse complement strand, although it does effectively bias the a priori 
probability that the sequence is coding in one of the forward three reading frames. The function 
25 above states that the three forward coding states are 19-fold (0.95/0.05) more probable than the 
other states, which is an assertion by the investigator that he is confident that the EST alignment 
is correct in indicating that the sequence is coding on that strand. 

Given the bias function defined above, the values for P'(S|a) are determined as before for 
the unbiased case. To calculate P f (o|S), however, equation XXXI is used: 
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P'{c\S) = 



<t>(a) . P{a) ■ P(S\a) 
£>(i) ■ P(i) ■ P(S\i)) 



(XXXI) 



5 The equations to determine P'(o|S) for each state are shown in equations XXXII through 

XXXVIII: 



(XXXII) 



£(0(i).P(i).P(S|i)] 

_ _ 0.95- -^(3.6479448 x 10~ a ) 

U.95--&(3.6470448xlO-<>)^..:+O.U5 -£(l.8692402x l<)-0) 



-7 



_ 2.8879503X10 
" 4.4399294x10- 7 

= 0.65045095 



10 



(XXXIII) 



P'(2'|S) = 



0 95 fr* g l»+> 

"• a,, 4.4399294xl0- 7 

0.010187213 



15 



(XXXIV) 



P'(3'|5) = 



v - :io 4.4399294x10- 7 

0.2C52289 



20 



(XXXV) 



P'(l-\S) = o-os .i^^- , 

= 0.05380379 



25 
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(XXXVI) 



P'(2 \S) — 0.05 4 .4*399294 xlO- 7 

= 0.0023707938 



(XXXVII) 



= 0.0004239267G 



(XXXVIII) 



P'(N\S) = O.OS^^pr 
= 0.0017534085 



Given the bias function 0(0), the resulting coding potential calculation indicates a 65% 
probability that the sequence is coding in the first reading frame on the forward strand. The 
result represents the coding probability given the assumptions of the investigator stated as the 
bias function. 

EXAMPLE 3 

The following is a copy of the output of a program implementing the method 
described above with and without a bias function. The following sequence is a genomic sample 
from the organism Arabidopsis thaliana, landsberg. 



TACTCAAAAATATATTCCATGCTTAATTAGGCCGGATTCGCGGTGACGATGCACCAAGAGCGGTTTTTCCGA 
GCATTGTAGGCCGTCCTCGCCACACCGGTGTGATGGTTGGGATGGGACAAAAGGATGCTTATGTTGGAGACGAGGCTC 
AATCAAAACGTGGTATCTTGACTCTGAAGTACCCAATTGAGCATGGAATTGTTAATAATTGGGATGACATGGAGAAGA 
TTTGGCATCACACTTTCTACAATGAGCTTCGTGTTGCCCCTGAAGAACATCCGGTTCTCTTGACCGAAGCTCCTCTCA 
ATCCGAAAGCTAACCGTGAGAAGATGACTCAGATCATGTTTGAGACATTCAATACTCCTGCTATGTATGTTGCCATTC 
AAGCTGTTCTCTCACTCTATGCCAGTGGCCGTACTACTGGTCAGTACATTACTACATTCTTTTTATACCGTTTGGTTG 
AAATAAAATTCGGTTTGGTTCGATTCGAGTTTGCTCTCATTATTTTTATTTTGTTGGTTAGGTATTGTTTTGGACTCC 
GGAGATGGTGTGAGCCACACGGTACCAATCTACGAGGGTTATGCACTTCCACACGCAATCCTGCGTCTTGATCTTGCA 
GGTCGTGACCTAACCGACCACCTTATGAAAATCCTGACAGAGCGTGGTTACTCTTTCACCACAACTGCTGAGCGTGAG 
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ATTGTTAGAGACATGAAGGAGAAGCTCTCTTACATTGCCTTGGACTTTGAACAAGAGCTCGAGACTTCCAAAACAAGC 
TCATCCGTTGAGAAGAGCTTCGAGCTGCCAGACGGTCAAGTGATCACCATCGGGGCAGAGCGTTTCCGATGCCCTGAA 
GTTCTGTTTCAGCCATCGATGATCGGAATGGAAAATCCGGGAATTCATGAAACTACTTACAACTCAATCATGAAATGT 
GATGTGGATATCAGGAAGGATCTTTATGGAAACATTGTGCTTAGTGGTGGCACCACAATGTTCGATGGGATTGGTGAT 
AGGATGAGTAAAGAGATCACAGCGTTGGCTCCAAGCAGTATGAAGATCAAAGTGGTGGCTCCACCGGAAAGGAAGTAC 
AGTGTCTGGATCGGTGGCTCTATCTTGGCTTCCCTCAGTACTTTCCAGCAGGTAAATTACTTACTATACTTAATACAT 
AAAGTCTATTAGTGATTTGATGTATAAAGTGTTACAAAAATGTGTTCCAAATTTGCAGATGTGGATTGCGAAAGCGGA 
GTATGATGAATCTGGACCGTCAATCGTCCACAGGAAGTGCTTCTGATCAAAAGTCACCAAGTAAAACAAGAGCGGTAA 
AAATTTTGATATCAGTTTTTCACCCTGAAGCCAGTTGCTATAATTACTCACAACTTCTCTATTTGTGTTCTTTTATTC 
TTGTCCCTCGTTGTTCATTTTAATCTCTTTTTTGCAACAAAGCAACTTAAAAAAACAGAGCAGTCATTAACAGAATGT 
TATTATTATATATATGTATACATATTAGTATACACCCATTATTTCATTAAAACATTTATCATATAAGGATAGGATTCT 
ATACATCGATATATTTATTTTGTTGACACTATTCAGCACATGCTTATGTCTTATCTTGTTAGTATATGTAACCAAAGA 
CAAATAATAGATGCTACAAATTGTTTTCTTTGAAGCAAAAATTTCAATCTTAAAATTGTTTTTTTCCAGGTTACACAA 
AAAAAACTTGTAGTTTGTAAATTTTCTATACAATTTTGGGGATCTCAACAAGAACATGAACTTCAACTTCTAGTCATA 
TGACGACCTGAGTCTGCGCGGCTGTGAATCTCTTTGCTGCAGTAAATGTTTACAAGTGGTGTGTAAATTGGTACTGAT 
TCAAAAGCTTTAAGAAATCTACACATTTCGTGAAATTATTTAGCAGACTTGATATTAAAAATCTAGGATAAAATGACT 
ATCCAAAGACAAATAGGACTGTTTCACATGTTCCCCTGATTCTTGTAGCTCATAACTCATCAGCAGTTAACTTTTCTA 
CCTCATACACGCTCGCAATNCGTTTGGAATTATCAGCTNTAATTTTTCTAATTCTTTGGAAATTATTAGCAGCTCGAT 
CAAATGGGGCATGGCTTCTTCTTCTATCTGCAACTCATCTAAACTTTCCATGAAGAAACAAAGCT (SEQ. ID. 

NO. 1) 

The sequence below is the same Arabidopsis sequence after coding probabilities have 
been determined without a bias, the coding strand has been determined, and each nucleotide has 
been classified in its most probable state of the four on the coding strand (dashes represent the 
state of noncoding). 



1: 1 

61- 111111111111313333333333333333133333333333333333333333333333 

121- 333323333333333333333333333313333333333333333333333333333333 

181- 333333333333333333333333333333333313333313333333333333133333 

241- 333333333133133333333133333333333333333333333333333333333133 

301- 333333333333333133333333333333333313333333333333333333333333 

361: 333-33333-333333-3333333333333333-33333333333333333333333333 

421: 333333333333—3 — 3—333333333-33 

481: 11—11-1- 

541- -11111111111111111111U11111111111111111 1111111111111111 - 111 

601- 111111111111111111U11111111111111111111111111 111111111 - 1111 

661- 1111111111-111-llllllllllHll-llllllllllll 111111111111111111 

721- 1111111111111-11111-11111111-lllHll 111111111111111111111111 

781" llllllllllllllllllllllllllllllllllllllll 1 ! 111111111111111111 

841- lllllllllllllllllllllllllllllllllllUll 1111111111111111 - 1111 

901- lllllllllllllllll-llllllllllllllllllH 11111111 - 1111111111111 
961: 111111111111111111111111U1111111111111111111111111111111111 

1021: llllllllllllllllllllllllllllllllllllH 1111111111111111111111 
1081: 1111111111111131111111111111131 " 

1201- 222-2222222222-22-222-222222-3333333333333333333333333 

1261 : 3333333333333333—33-3—3-3 33-33333333-333 — — 

1321: - — 333 — 3 ~~ 

1381: 



50 



10 



15 



1441 

1501 
1561 
1621 
1681 
1741 
1801 
1861 
1921 
1981 
2041 
2101 
2161 



-1—1- 



3—33-3 333-3 3 

3-3133-33-33-3 13-22222-222222-2222222222222-2 2 

—22 2222-1222222222222222221222222222222222222222222 

22222 

The classifications are now filtered. First, simple gaps are filled (XYX are reclassified as 



XXX): 



20 



25 



30 



35 



40 



45 



50 



1: 
61: 
121: 
181: 
241: 
301: 
361: 
421: 
481: 
541: 
601: 
661: 
721: 
781: 
841: 
90.1: 
961: 
1021: 
1081: 
1141: 
1201: 
1261: 
1321: 
1381: 
1441: 
1501: 
1561: 
1621: 
1681: 
1741: 
1801: 
1861: 
1921: 
1981: 
2041: 
2101: 



111111111111313333333333333333133333333333333333333333333333. 

333323333333333333333333333313333333333333333333333333333333 

333333333333333333333333333333333313333313333333333333133333 

333333333133133333333133333333333333333333333333333333333133 

333333333333333133333333333333333313333333333333333333333333 

333333333333333333333333333333333333333333333333333333333333 

333333333333—3—3 333333333333 

11—1111- 

-llllllllllllllllllllllllllllllllllllHl 11111111111111111111 
lllllllllllllllllllllllllllllllllllllllllim 111111111111111 
1111111111111111111111111111111111I11111111 11111111111111111 
lllllllllllllllllllllllllllllllllH 1111111111111111111111111 
llllllllllllllllllllllllllllllllllllHl 111111111111111111111 
lllllllllllllllllllllllllllllllllllHl 1111111111111111111111 
lllllllllllllllllllllllllllllllllllll 11111111111111111111111 
lllllllllllllllllllllllllllllllllllllllllim 111111111111111 
lilllllllllllllllllllllllllllllllllH 11111111111111111111111 
111111111111113111111111111H31 

2222222222222222222222222222-3333333333333333333333333 

3333333333333333-3333—3—3 333333333333333 — 

—333—3 



3—3333 33333 3 

33313333333333 13-2222222222222222222222222222 2 

—22 2222-1222222222222222221222222222222222222222222 
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2161: 22222 

Next, XXYXX gaps are reclassified as XXXXX: 

1 

3 61- ImiU11111313333333333333333333333333333333333333333333333 

121- 333333333333333333333333333333333333333333333333333333333333 
18li 333333333333333333333333333333333333333333333333333333333333 
241- 333333333333333333333333333333333333333333333333333333333333 
10 301: 333333333333333333333333333333333333333333333333333333333333 

361- 333333333333333333333333333333333333333333333333333333333333 

421: 333333333333 333333333333- 

481 • ~~ 

541- -llllllllllllllllllllllllllUlllllllll.U 11111111111111111111 

15 601 ": llllllllllllllllllllllllllllllllllllll 111111111111 """"" 

661- llllllllllllllllllllUUlllllllllllll 11111111111111111111111 

721* lllllllllllllllllllllllllUllllllllll 11111111111111111111111 

781- 1111111111111111U1111111111111111 11111111111111111111111111 

841- llllllllllllllllllUllUllllllllll 11111111111111111111111111 

20 901 : lllllllllllllllllllllllllllllllllllllH 111111111111111111111 

961- llllllllllllllllllllllllllllllllHl 1111111111111111111111111 

102li llllllllllllllllllllllllllllllllHl 1111111111111111111111 ^ 1 ^ 

1081: llllllllllllllllllllllllim 131 lllllllll 

1141* 

25 120 1: 2222222222222222222222222222-3333333333333333333333333 

1261: 3333333333333333—3333 333333333333333------ 

1321: 333 

1381: " 

1441: _ ' ~ 

30 1501: ~ ~ 

1561: ~~~ "_ l__l-lll 

1621: ' 

1681: ■ " ~ Jl-l 

1741: 

35 1801: — . ~ 

1861: "~~ ~ 

1921: ~~ _ " 

1981 . 3333 33333 J 

2041* 33333333333333- 13-2222222222222222222222222222 

40 210li -22 2222-1222222222222222222222222222222222222222222 

2161: 22222 

Next, XXYYXX gaps are reclassified as XXXXXX: 

. 1 

45 61- 111111111111313333333333333333333333333333333333333333333333 

121- 333333333333333333333333333333333333333333333333333333333333 

181- 333333333333333333333333333333333333333333333333333333333333 

24l": 333333333333333333333333333333333333333333333333333333333333 

50 301 : 33333333333333333333333333333333333333333333333333333^ 

361- 333333333333333333333333333333333333333333333333333333333333 

421: 333333333333 333333333333--— """" nlll 

481: 
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10 



15 



20 



25 



541 
601 
661 
721 
781 
841 
901 
961 
1021 
1081 
1141 
1201 
1261 
1321 
1381 
1441 
1501 
1561 
1621 
1681 
1741 
1801 
1861 
1921 
1981 
2041 
2101 
2161 



lllllllllllllllllllllllllllllllllll 1111111111111111111111111 
llllllllllllllllllllUlllllllllllllllll 111111111111111111111 
11111111111 llllllllllllllllllllllllllllllllHl 11111111111111 
lllllllllllllllllllllllllllllllHl 11111111111111111111111111 
llllllllllllllllllllllllllllllllllH 111111111111111111111111 
1111111111111111111111111 11111 111111111111111111111 111111111 

llllllllllllllllllllllllllllllH 1111111111111111111111111111 
lllllllllllllllllllllllllllllllH 111111111111111111111111111 
lllllllllllllllllllllllllllllllHl 11111111111111111111111111 
1111111111111111111111H1111131 

2222222222222222222222222222-3333333333333333333333333. 

3333333333333333333333 333333333333333 — - 

— 333 



3333 33333 3 

33333333333333 13-2222222222222222222222222222 

222222222222222222222222222222222222222222222222 

22222 



30 



Next, XYYX gaps are reclassified as XXXX: 



35 



40 



45 



50 



1: 
61: 
121: 
181: 
241: 
301: 
361: 
421: 
481: 
541: 
601: 
661: 
721: 
781: 
841: 
901: 
961: 
1021: 
1081: 
1141: 
1201: 
1261: 



111111111111313333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 

333333333333 

11111 



333333333333- 



llllllllllllllllllllllllllllllllllllllllll 111111111111111111 
llllllllllllllllllllllllllllllllllllllllllll 1111111111111111 

liiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiin 111111111111111111 
lllllllllllllllllllllllllllllllllllllllllll 11111111111111111 
lllllllllllllllllllllllllllllllllllllllll 1111111111111111111 
llllllllllllllllllllllll'llllllllllllllllllll 1111111111111111 
liiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiuimi 11111111 
liiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiinnii 11111111111111111 
liiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiniiiini 111111111 

111111111111111111111111111H31 

2222222222222222222222222222-3333333333333333333333333 

3333333333333333333333 333333333333333 



53 



10 



15 



1321 
1381 
1441 
1501 
1561 
1621 
1681 
1741 
1801 
1861 
1921 
1981 
2041 
2101 
2161 



— 333- 



3333 33333 3 

33333333333333 13-2222222222222222222222222222 

222222222222222222222222222222222222222222222222 

22222 



Next, XYX gaps are reclassified as XXX: 



20 



25 



30 



35 



40 



45 



50 



1: 
61: 
121: 
181: 
241: 
301: 
361: 
421: 
481: 
541: 
601: 
661: 
721: 
781: 
841: 
901: 
961: 
1021: 
1081: 
1141: 
1201: 
1261: 
1321: 
1381: 
1441: 
1501: 
1561: 
1621: 
1681: 
1741: 
1801: 
1861: 
1921: 
1981: 
2041: 



111111111111113333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333 333333333333 



lllllllllllllllllllllllllllllllllllllllll 1111111 
111111111111111111111111111111111111111111111111 
llllllllllllllllllllllllllllllllllllllllH 111111 
lllllllllllllllllllllllllllllllllllllllllll 11111 
llllllllllllllllllllllllllllllllllllllllll 111111 
llllllllllllllllllllllllllllllllllllllll 11111111 
lllllllllllllllllllllllllllllllllllllllUll 11111 
llllllllllllllllllllllllllllllllllllllllUlll 111 
lllllllllllllllllllllllllllllllllllllllllHl 1111 
1111111111111111111111111111111 



11111 

111111111111 
111111111111 
111111111111 
111111111111 
111111111111 
111111111111 
111111111111 
111111111111 
111111111111 



2222222222222222222222222222-3333333333333333333333333 

3333333333333333333333 333333333333333 

— 333 



3333 33333 

33333333333333 13-2222222222222222222222222222- 
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210 1: 222222222222222222222222222222222222222222222222 

2161: 22222 

Next, regions between coding regions that are not introns are reclassified according to the 
5 adjacent sequences: 

1: 1 

61: 111111111111113333333333333333333333333333333333333333333333 
121: 333333333333333333333333333333333333333333333333333333333333 

10 181: 333333333333333333333333333333333333333333333333333333333333 
241: 333333333333333333333333333333333333333333333333333333333333 
301: 333333333333333333333333333333333333333333333333333333333333 
361: 333333333333333333333333333333333333333333333333333333333333 
421: 333333333333333333333333333333333 

15 481: 11111 

541: lllllllllllllllllllllllllllllllllllllllllllllllllinilllllll 
601: llllllllllllllllllllllllllllllllllllllllllllllllllllllimil 
661: lllllllllllllllllllllllllllllllllllllllllllllllllllllllllin 
721: 111111111111111111111111111111111111111111111111111111111111 

20 781: lllllllllllllllllllllllllllllllllllllllllllllllllllllllllin 

841: llllllllllllllllllllllllllllllllllllllllllllllllllllllimil 
901: lllllllllllllllllllllllllllllllllllllllllllllllllllllllllin 
961: lllllllllllllllllllllllllllllllllllllllllllllllllllllllllin 
1021: 111111111111111111111111111111111111111111111111111111111111 

25 1081: lilllllllllllllllllllllllllllH 

1141: 

12 01- 222222222222222222222222222233333333333333333333333333 

1261: 333333333333333333333333333333333333333333333333333333333333 

1321: 333333 

30 1381: " 

1441: 

1501: 

1561: 

1621: 

35 1681: 

1741: 

1801: 

1861: 

1921: 

40 1981- 3333333333333333333333333333333333333333333333 

2041: 333333333333333311132222222222222222222222222222222222222222 
2101: 222222222222222222222222222222222222222222222222222222222222 
2161: 22222 

45 Next, the sequence is checked for frameshifts and reclassified accordingly: 



l 



1 : 

61: llllllllllllllllllllllllllllllllllllllllllllllllllllllllini 

121: lllllllllllllllllllllllllllllllllllllllllllllllllllllllinil 

50 181: lllllllllllllllllllllllllllllllllllllllllllllllllllllllllin 

241: liiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiniiin 33333333333333333 

301: 333333333333333333333333333333333333333333333333333333333333 

361: 333333333333333333333333333333333333333333333333333333333333 
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421: 333333333333333333333333333333333- 



481: 



11111 

541: llllllllllllllllllllllllllllllllllllllllllllllllllllllllllH 
601: llllllllllllllllllllllllllllllllllllllllllllllllllllllllUll 
661: lllllllllllllllllllllllllllllllllllllllllllllllllllllllllHl 
721: lllllllllllllllllllllllllllllllllllllllllllllllllllllllllin 
781: llllllllllllllllllllllllllllllllllllllllllllllllUll 1 lllim 
841: llllllllllllllllllllllllllllllllllllllllllllllllllllllllllH 
901: lllllllllllllllllllllllllllllllllllllllllllllllllllllllllHl 
961: llllllllllllllllllllllllllllllllllllllllllllllllllllllllllH 
1021: lllllllllllllllllllllllllllllllllllllllllllllllllllllllllin 

1081: 1111111111111111111111111111111 

1141: 

12 oi: 222222222222222222222222222222222222222233333333333333 

1261: 333333333333333333333333333333333333333333333333333333333333 

1321: 333333 

1381: 

1441: 

1501: 

1561: 

1621: 

1681: 

1741: 

1801: 

1861: 

1921: 

1981: 3333333333333333333333333333333333333333333333 

2041: 333333333333333333333333333333333222222222222222222222222222 
2101: 222222222222222222222222222222222222222222222222222222222222 
2161: 22222 

Finally, the sequence is translated according to each class in each coding region, where 
V indicates a stop codon: 



1 : XRFFRALxAVLATPVxWLGWDKRMLMLETRLNQNVVSxLxSTQLSMELLIIGMTWRRFGI 

61 : TLSTMSFVLPLKNIRXLTEAPLNPKANREKMTQIMFETFNTPAMYVAIQAVLSLYASGRT 

121 : TGQYITTFFLYRXSGDGVSHTVPIYEGYALPHAILRLDLAGRDLTDHLMKILTERGYSFT 

181 : TTAEREIVRDMKEKLSYIALDFEQELETSKTSSSVEKSFELPDGQVITIGAERFRCPEVL 

241 : FQPSMIGMENPGIHETTYNSIMKCDVDIRKDLYGNIVLSGGTTMFDGIGDRMSKEITALA 

301 : PSSMKIKVVAPPERKYSVWIGGSIXVPNLQMWIAKAEYXNLDRQSSTGSASDQKSPSKTR 

361 : AVKILXNSSAVNFSTSYTLAIRLELSALIFLISLEIISSSIKWGMASSSICNSSKLSMKK 

421 : QSX (SEQ. ID. NO. 2) 

The following sequence is the same Arabidopsis sequence used above, but with an 
applied bias. Two bias functions are given by equations XXXIX and XL: 
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PlW \ 0.05 if<7 = N 

(XXXIX) 



(XL) 



Ma) = \ 0.95 



if a = ,V 



5 where 0, is applied to a range of the DNA to which an EST has been associated, while o 2 

is applied to a range of the DNA to which a gap (or intron) in the EST has been associated. 
Specifically, 0, is applied to nucleotides 1093 through 1 137 and 1219 through 1291, while 0 2 is 
applied to nucleotides 1 138 through 1218. The probabilities are calculated with the bias, the 
coding strand is determined, and each nucleotide is classified as the most likely state. The 

10 resulting sequence is depicted below. 

1: 1 

61: 111111111111313333333333333333133333333333333333333333333333 

121: 333323333333333333333333333313333333333333333333333333333333 

15 181: 333333333333333333333333333333333313333313333333333333133333 

241: 333333333133133333333133333333333333333333333333333333333133 

301: 33333333333333313333.3333333333333313333333333333333333333333 

361: 333-33333-333333-3333333333333333-33333333333333333333333333 

421: 333333333333—3—3—333333333-33 

20 481: 11 — 11-1- 

541: -1111111111111111111111111111111111111111111111111111111-111 

601: 1111111111111111111111111111111111111111111111111111111-1111 

661: 1111111111-111-11111111111111-111111111111111111111111111111 

721: 1111111111111-11111-11111111-1111111111111111111111111111111 

25 781: 111111111111111111111111111111111111111111111111111111111111 

841: 1111111111111111111111111111111111111111111111111111111-1111 

901: 11111111111111111-1111111111111111111111111111-1111111111111 

961: 111111111111111111111111111111111111111111111111111111111111 

1021: 111111111111111111111111111111111111111111111111111111111111 

30 1081: 11111111111111311111111111111311111111-1 

1141: 

12 01 : 221221222122222213333333333333333333333333 

1261: 3333333333333333333333333333333-33-33333333-333 

1321: —333—3 



35 1381: 
1441: 
1501: 
1561: 
1621: 

40 1681: 
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1801 
1861 
1921 
1981 
2041 
2101 
2161 



3—33-3 333-3 3 

3-3133-33-33-3 13-22222-222222-2222222222222-2 2 

__ 22 2222-1222222222222222221222222222222222222222222 

22222 



10 Filtering steps are then applied as before: XYX to XXX: 

!- 1 

61- 111111111111313333333333333333133333333333333333333333333333 
121- 333323333333333333333333333313333333333333333333333333333333 

15 181- 333333333333333333333333333333333313333313333333333333133333 
241- 333333333133133333333133333333333333333333333333333333333133 
301- 333333333333333133333333333333333313333333333333333333333333 
361: 333333333333333333333333333333333333333333333333333333333333 
421 : 333333333333—3—3 333333333333 """"""""" I 

2Q 481* ~~ ^ 

541- -111111111111111111U111111111111111 111111111111111111111111 
601- lllllllllllllllllllllllllllUlllllllH 1111111111111111111111 
661- 111111111111111111111U11111111111111 11111111111111111111111 
721- lllllllllllllllllllllllllllllllllllH 11111111111111111111111 

25 781- 11111111111111111111111111U111111111 11111111111111111111111 

841" lllllllllllllllllllllllUllllllllll 1111111111111111111111111 
901- lllllllllllllllllllllllllllllllllllH 11111111111111111111111 
961- llllllllllllllllllllllllllimillll 1111111111111111111111111 

1021: liiiiimiiiiiuiiiiiiiiiiuiiiiiiiiiiiiiiiiiiiiiiiiiiuiiii 

30 1081: 11111111111111311111111111111311111111H _ I 

^201- —————— 221221222122222213333333333333333333333333 

1261: 33333333333333333333333333333333333333333333333— —---—-— 
1321: —333—3 



35 1381: 
1441: 
1501: 
1561: 
1621: 
40 1681: 
1741: 
1801: 
1861: 
1921: 



45 1981: 3—3333 33333. 3 

2041: 33313333333333- 13-2222222222222222222222222222 2 

2101: __ 22 2222-1222222222222222221222222222222222222222222 

2161: 22222 

50 XXYXX to XXXXX: 

61: 111111111111313333333333333333333333333333333333333333333333 
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121: 333333333333333333333333333333333333333333333333333333333333 

181: 333333333333333333333333333333333333333333333333333333333333 

241: 333333333333333333333333333333333333333333333333333333333333 

301- 333333333333333333333333333333333333333333333333333333333333 

361: 333333333333333333333333333333333333333333333333333333333333 

421: 333333333333 333333333333 

481: 11—1111- 

541- -llllllllllllllllllllllllllllllllllllllUlll 1111111111111111 

601: 111111111111111111111111111U11111111111111 11111111111111111 

661- lllllllllllllllllllllllllllllllllllllH 111111111111111111111 

721- llllllllllllllllllllllllllllllllllllllllH 111111111111111111 

781- lllllllllllllllllllllllllllllllllllllllH 1111111111111111111 

841- llllllllllllllllllllllllllllllllllllUll 11111111111111111111 

901- lllllllllllllllllllllllllllllllllllllUll 1111111111111111111 

961- lllllllllllllllllllllllllllllllllllllllH 1111111111111111111 

1021: llllllllllllllllllllllllllllllllllllllHl 1111111111111111111 

1081: lllllllllllllllllllllllllllllUll 1111111 

1141: 

120 1: 222222222222222213333333333333333333333333 

126li 33333333333333333333333333333333333333333333333 ---- 

1321: 333 ~~~ ~ 

1381: _ 

1441: 

1501: ~ 

1561: ~~" " 

1621: " 

1681: __ 

1741: : ~~~ ~_ 

1801: ~; ; . ' 

1861: ~~ _ 

1921: 

1981: 3333 33333 J 

2041: 33333333333333 13-2222222222222222222222222222 

2101: —22 2222-1222222222222222222222222222222222222222222 

2161: 22222 

XXYYXXtoXXXXXX: 



61- 111111111111313333333333333333333333333333333333333333333333 

121- 333333333333333333333333333333333333333333333333333333333333 

181- 333333333333333333333333333333333333333333333333333333333333 

241- 333333333333333333333333333333333333333333333333333333333333 

301- 333333333333333333333333333333333333333333333333333333333333 

361: 333333333333333333333333333333333333333333333333333333333333 

421: 333333333333 333333333333 

481: urn 

541- UllllllllllllllllllllllllllllllllllllllllllUll 111111111111 

601: liiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiniiiiinn 1111 

661: lllllllllllllllllllllllllllllllllllllllllllimiini 11111111 

721- lllllllllllllllllllllllllllllllllllllllll 111 ! 11 !! 1 ! 111 ! 11111 

78i- liiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiimiiii 11111111 

84i- liiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiniiiiinni 11111 
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10 



15 



20 



901 
961 
1021 
1081 
1141 
1201 
1261 
1321 
1381 
1441 
1501 
1561 
1621 
1681 
1741 
1801 
1861 
1921 
1981 
2041 
2101 
2161 



lllllllllllllllllllllllllllllllllllllllllllH 111111111111111 
111111111111111111111111111111 111111111111111111111111111111 
1111111111111111111111111111111111111 11111111111111111111111 
1111111111111111111111111111111111111111 

222222222222222213333333333333333333333333 

33333333333333333333333333333333333333333333333 

— 333 



3333 33333 3 

33333333333333 13-2222222222222222222222222222 

222222222222222222222222222222222222222222222222 

22222 



XYYX to XXXX: 



25 



30 



35 



40 



45 



50 



1: 
61: 
121: 
181: 
241: 
301: 
361: 
421: 
481: 
541: 
601: 
661: 
721: 
781: 
841: 
901: 
961: 
1021: 
1081: 
1141: 
1201: 
1261: 
1321: 
1381: 
1441: 
1501: 
1561: 
1621: 



111111111111313333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 

333333333333 333333333333 

11111 

llllllllllllllllllllllllllllllllllllllllUlll 111111111111111 
111111111111111111111111111111111111111H 1111111111111111111 
lllllllllllllllllllllllllllllllllllllllH 1111111111111111111 
lllllllllllllllllllllllllllllllllllllUll 1111111111111111111 
llllllllllllllllllllllllllllllllllllllllH 111111111111111111 
lllllllllllllllllllllllllllllilllllllUll 1111111111111111111 
lllllllllllllllllllllllllllllllllllllllH 1111111111111111111 
11111111111111111111111111111111111111111H 11111111111111111 
lllllllllllllllllllllllllllllllllllllUlll 111111111111111111 
1111111111111111111111111111111111111111 

222222222222222213333333333333333333333333 

33333333333333333333333333333333333333333333333 

333 



60 



10 



1741 
1801 
1861 
1921 
1981 
2041 
2101 
2161 



3333 33333 3 

33333333333333 13-2222222222222222222222222222 

222222222222222222222222222222222222222222222222 

22222 



XYX to XXX: 



15 



20 



25 



30 



35 



40 



45 



50 



1: 
61: 
121: 
181: 
241: 
301: 
361: 
421: 
481: 
541: 
601: 
661: 
721: 
781: 
841: 
901: 
961: 
1021: 
1081: 
1141: 
1201: 
1261: 
1321: 
1381: 
1441: 
1501: 
1561: 
1621: 
1681: 
1741: 
1801: 
1861: 
1921: 
1981: 
2041: 
2101: 
2161: 



111111111111113333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 
333333333333333333333333333333333333333333333333333333333333 

333333333333 333333333333 

11111 

lllllllllllllllllllllllllllllllllllllll 1 ! 1111111111111111111 
lllllllllllllllllllllllllllllllllllllllH 1111111111111111111 
lllllllllllllllllllllllllllllllllllllllllllllH 1111111111111 
llllllllllllllllllllllllllllllllllllllllllllllll 111111111111 
lllllllllllllllllllllllllllllllllllllllllllllH 1111111111111 
lllllllllllllllllllllllllllllllllllH 11111111111111111111111 
llllllllllllllllllllllllllllllllllllllllllllllll 111111111111 
illlllllllllllllllllllllllllllllllllllllimiH 1111111111111 
llllllllllllllllllllllllllllllllHl 1111111111111111111111111 
llllllllllllllllllllllllllllllllH 111111 

222222222222222213333333333333333333333333 

33333333333333333333333333333333333333333333333 

— 333 



3333 33333 3 

33333333333333 13-2222222222222222222222222222 

222222222222222222222222222222222222222222222222 

22222 



Gaps between coding regions that are not introns are filled as before: 



61 



61: 111111111111113333333333333333333333333333333333333333333333 

121: 333333333333333333333333333333333333333333333333333333333333 

181: 333333333333333333333333333333333333333333333333333333333333 

241: 333333333333333333333333333333333333333333333333333333333333 

5 301: 333333333333333333333333333333333333333333333333333333333333 

361: 333333333333333333333333333333333333333333333333333333333333 

421: 333333333333333333333333333333333 

481s U1U 

541- 111111111111111111111111111111111111111111111111111111111111 

10 601: lllllllllllllllllllllllllllllllllllllllllllllllllllllllllin 

661: lllllllllllllllllllllllllllllllllllllllllllllllllllllllllin 

721- lllllllllllllllllllllllllllllllllllllllllllllllllllllllllin 

781- lllllllllllllllllllllllllllllllllllllllllllllllllllllllllin 

841- llllllllllllllllllllllllllllllllllllllllllllllllllllllinill 

15 901: lllllllllllllllllllllllllllllllllllllllllllllllllllllinnil 

961- lllllllllllllllllllllllllllllllllllllllllllllllllllllllUlll 

1021: llllllllllllllllllllllllllllllllllllllllllllllllllllllinill 

1081: 1111111111111111111111111111111111111111 

1141: — " 

20 1201- 222222222222222213333333333333333333333333 

1261: 333333333333333333333333333333333333333333333333333333333333 

1321: 333333 

1381: "I "ZZ 

1441: 

25 1501: ZZZZZZZZ_ 

1561: 

1621: 

1681: : ""ZZZ 

1741: 

30 1801: 

1861: """ZZZ 

1921: 

198 i- 3333333333333333333333333333333333333333333333 

2041: 333333333333333311132222222222222222222222222222222222222222 
35 2101: 222222222222222222222222222222222222222222222222222222222222 

2161: 22222 

Frameshifts are verified and nucleotides are reclassified accordingly: 



1: x 

6i- liiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiimiii 

i2i- liiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiini 

i8i- liiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiniiiiini 

241- 111111111111111111111111111111111111111111133333333333333333 

301: 333333333333333333333333333333333333333333333333333333333333 

361: 333333333333333333333333333333333333333333333333333333333333 

421: 333333333333333333333333333333333 

481: 111U 

541: liiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiinmni 11111111 

601: 111111111111111111111111111111111111111111111111111111111111 

661: 111111111111111111111111111111111111111111111111111111111111 

721: liiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiniiinini 

781: 11111111111111 llllllllllllllllllllllllllllllllllUllimill 1 
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841- lllllllllllllllllllllllllllllUlllll 111111111111111111111111 

901: 11111111111111111U1111111111111111111 1111111111111111111111 

961" llllllllllllllllllllllllllllllllUlll 1111111111111 1111111111 

1021: 11111111111111111111111111111111111H 11111111111111111111111 

1081: lllllllllllllllllllllllllllllllUll 11111 

____! 222222222222222222222222222233333333333333 

1261 1 333333333333333333333333333333333333333333333333333333333333 

1321: 333333 ~ _ V 

1381: 



1441: 

1501: ~" 

1561: 

1621: ~ __ ____ __ 

1681: " _ 

1741: _~ ___ 

1801: -- --- 

1861 : " _ ______ ______ 

1921: 

1981 . 3333333333333333333333333333333333333333333333 

2041- 333333333333333333333333333333333222222222222222222222222222 

2101: 222222222222222222222222222222222222222222222222222222222222 

2161: 22222 

And the sequence is translated as before: 

1 • XRFFRALxAVLATPVxWLGWDKRMLMLETRLNQNWSxLxSTQLSMELLIIGMTWRRFGI 

61 • TLSTMSFVLPLKNIRXLTEAPLNPKANREKMTQIMFETFNTPAMYVAIQAVLSLYASGRT 

m - TGQYITTFFLYRXSGDGVSHTVPIYEGYALPHAILRLDLAGRDLTDHLMKILTERGYSFT 
181- TTAEREIVRDMKEKLSYI ALDFEQELETSKTSSSVEKSFELPDGQVITIGAERFRCPEVL 

241 • FQPSMIGMENPGIHETTYNSIMKCDVDIRKDLYGNIVLSGGTTMFDGIGDRMSKEITALA 

301 • PSSMKIKVVAPPERKYSVWIGGSILASXQMWIAKAEYXNLDRQSSTGSASDQKSPSKTRA 

361 : VKILXNSSAVNFSTSYTLAIRLELSALIFLISLEIISSSIKWGMASSSICNSSKLSMKKQ 

421 : SX (SEQ. ID. NO. 3) 

The resulting amino acid sequence (SEQ. ID. NO. 3) differs from the amino acid 
sequence calculated without a bias (SEQ. ID. NO. 2). The relative accuracy of the two amino 
acid sequences can be determined by comparison to a known sequence. SEQ. ID. NO. 2 and 
SEQ. ID. NO. 3 are compared to the translation of the actin gene from Arabidopsis thaliana, 
Columbia (SEQ. ID. NO. 4). Dashes indicate gaps in the sequence and asterisks indicate a match 
among all three sequences. The predicted amino acid sequences (SEQ. ID. NOs. 2 and 3) are 
based on an Arabidopsis thaliana, landsberg ecotype. A comparison of the predicted with a 
known Arabidopsis thaliana, Columbia ecotype amino acid sequence (SEQ. ID. NO. 4) is shown 
below. The sequence set forth in Box A illustrates an area of the biased sequence that shows a 
higher level of identity with the Arabidopsis thaliana, Columbia sequence. 
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unbiased -XRFFRALX-AVLATPVXWLGWDKRMLMLETRLNQNVVSX LXSTQLSMELLIIG M 
biased -XRFFRALX-AVLATPVXWLGWDKRMLMLETRLNQNVVSX LXSTQLSMELLIIG M 

Columbia GDDAPRAVFPSIVGRPR-HTGVMVGMGQKDAYVGDEAQSKRGILTLKYPIEHGIVNNWDD 



TWRRFGITLSTMSFVLPLKNIRXLTEAPLNPKANREKMTQIMFETFNTPAMYVAIQAVLS 
TWRRFGITLSTMSFVLPLKNIRXLTEAPLNPKANREKMTQIMFETFNTPAMYVAIQAVLS 
MEKIWHHTFYNELRVAPEEHPVLLTEAPLNPKANREKMTQIMFETFNTPAMYVAIQAVLS 
* * * ************************************* 

LYASGRTTGQYITTFFLYRXSGDGVSHTVPIYEGYALPHAILRLDLAGRDLTDHLMKILT 
LYASGRTTGQYITTFFLYRXSGDGVSHTVPIYEGYALPHAILRLDLAGRDLTDHLMKILT 

Columbia L-ASGRTTGG 1 VLDSGDGVSHT VP I YEGYALPH AI LRLDLAGRDLT DHLMKI LT 

* ******* **************************************** 

unbiased ERGYSFTTTAEREIVRDMKEKLSYIALDFEQELETSKTSSSVEKSFELPDGQVITIGAER 

biased ERGYSFTTTAEREIVRDMKEKLSYIALDFEQELETSKTSSSVEKSFELPDGQVITIGAER 

Columbia ERGYSFTTTAEREIVRDMKEKLSYIALDFEQELETSKTSSSVEKS FELPDGQVITIGAER 

coiumoxa ^^^^^*********************************************** 



unbiased 

biased 

Columbia 



unbiased 
biased 



unbiased 

biased 

Columbia 



Columbia 



FRCPEVLFQPSMIGMENPGIHETTYNSIMKCDVDIRKDLYGNIVLSGGTTMFDGIGDRMS 
FRCPEVLFQPSMIGMENPGIHETTYNSIMKCDVDIRKDLYGNIVLSGGTTMFDGIGDRMS 
FRCPEVLFQPSMIGMENPGIHETTYNSIMKCDVDIRKDLYGNIVLSGGTTMFGGIGDRMS 

**************************************************** ******* 

Box A 



unbiased KEITALAPSSMKIKWAPPERKYSvUlGGSIX VPNLQMWIAKAEYXNLDRQSSTG 

biased ' KEITALAPSSMKIKWAPPERKYSVWIGGSILAS XQMWIAKAEYXNLDRQSSTG 



KEITALAPSSMKIKWAPPERKYSVWIGGSILA^LSTFQQMQMWIAKAEY 
******************************* 



DESG 

********* * 



unbiased SASDQKSPSKTRAVKILXNSSAVNFSTSYTLAIRLELSALIFLISLEIISSSIKWGMASS 

biased SASDQKSPSKTRAVKILXNSSAVNFSTSYTLAIRLELSALIFLISLEIISSSIKWGMASS 

Columbia PS IVHRKCF 

** * 

unbiased SICNSSKLSMKKQSX 

biased SICNSSKLSMKKQSX 

Columbia 
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